Dark data in Big Data & Analytics

Artificial Intelligence AIBig Data / AnalyticsColumns

By Srikanth RP On Apr 30, 2018

By Arvind Purushothaman, VP – Data & Analytics, Virtusa Corp

Dark data is data that is typically not used by enterprises for decision making. It could be data from sensors, logs, any other transactional data that is largely available but ignored. It also constitutes the biggest portion of the total volume of big data collected by organizations in a year. Dark data is not usually analysed because enterprises do not have the bandwidth or technical capability or do not believe the data will add value. All are valid reasons for organizations to ignore this data. However, in today’s world, with various advances in technology, and ability to source, ingest, store and analyse large volumes in data by correlating it with other data sources, it becomes important for organizations to recognize this largely untapped data.In the past, the enterprises considered that all of its data could be systematically converged into a data warehouse and then identified, reconciled, rationalized, and generally tidied up and reported on. However, this approach limits the ability to analyse dark data but there may not be a project driving this need.But evidence now reveals that 90 percent of all data across the enterprise is dark. Given that enterprises have now moved on to ingesting and storing large volumes of data in a Data Lake, it makes sense to store this data and tag it as it is getting stored. Extracting metadata out of this data will be key to exploiting the data. This data can be profiled and explored using many tools available in the market including visualisation products. Advances in Machine Learning and Cognitive Computing combined up cheaper storage and increased processing power has opened up the possibilities of leveraging “dark” data intelligently.

Dark data can be both structured and unstructured. An example of structured data is contracts and reports of an organization, which become dark over due course of time. Unstructured data can be bits of personally identifiable information like birth dates, Aadhar number and billing details. In the past, this type of data has remained “dark”. However, with advances in Machine Learning, this data can be extracted in an automated manner and connected with other data attributes to provide a more complete view. Another typical example is geolocation data which is very valuable when used in real-time but decays rapidly. However, even after using this data in real-time, this data can be used leveraging Machine Learning to leverage historical data to predict outcomes.

Other examples of data that was considered “dark” in the past include any data from logs, sensors, emails and voice transcripts. They would be used only for “troubleshooting” purposes rather than for intelligent decision making. With the ability to convert voice-to-text and use the text to glean intelligence, many use cases have emerged that takes advantage of traditional “dark” data. Another example would be the CCTV logs which remained largely unused. Now, it can be used to understand traffic patterns, identify criminals etc.

Estimates from IDC indicate the universe of data will touch 44 ZB (zettabytes) by 2020. This explosion of data will be influenced by many new data sources such as the Internet Of Things (IoT). A significant percentage of the data will be “dark”unless we can illuminate it using new technologies and processes.

The first step is to be able to make all data including “dark” data available for exploration. With Data Lake platforms providing cost effective solutions, both on On-Premise and on the Cloud, it is much easier to bring the data to a common platform. The next step is to catalogue the data, extract metadata, and do a preliminary check for data quality. Modern Data Management and Data Visualization tools provide the ability to catalogue and visually explore the data. This can help determine if the data should be illuminated any further to remove the noise. If the decision is made to move forward, a business case with a specific goal should be framed followed by a more detailed analysis and technology implementation.

Given the recent concerns over data privacy and the GDPR regulations coming into play, it will be important to ensure compliance when we source new data. It also pays to understand what data you currently have in your servers. Perhaps, there are elements of data that must not be stored like Personally Identifiable Information (PII), and it makes sense to scan the archives to ensure there are no issues with compliance.

Advances in Artificial Intelligence (AI) mean we now have new ways of unravelling the secrets of “dark” data, but like any business process or tool, if applied incorrectly we can end up with incorrect results or invite the wrath of regulators. With all the data now available, especially when considering the vastness of “dark” data in the enterprise, insights need to be targeted, refined, and focused in order to produce actionable results.