Can the schema-on-read approach lead to data blind spots?

By Rashi Varshney On Jun 28, 2016

By Rajesh Kamath

The buzz around big data technologies is forcing financial services (and indeed, other) organizations to take a fresh look at their data ‘pipelines’ – the mechanisms they use to store data and ferry it between its sources and its consumers. Arguably, the process undergoing the maximum change due to these technologies is the extraction /loading /transformation (ETL) of data.

A much needed (data) journey

ETL processes help get data that has been ‘discovered’ in an organization, from its existing locations to the places where it can be readied for use in the data consumer-facing processes (e.g. data warehouses, datamarts, ODS etc.). In addition to being transported between applications, the data can also be improved, enriched and transformed along the journey for more efficient use.

The promise of speed

The elements of the ETL process have been moved around by big data technology vendors – the current avatar is the ELT process, driven by the ‘schema-on-read’ philosophy. This philosophy focuses on getting data into the big data infrastructure from the source systems in it’s as-is form, without bothering much about its structure (‘schema’). (E)xtraction and (L)oading of data gain prominence. (T)ransformation of data, when required, is then carried out subsequently depending on need. There is no doubt that this philosophy, supported by technology, has helped speed up the technical implementation of acquisition of data by the data infrastructure. On the other hand, I also wonder if this increased efficiency in certain parts of the data pipeline can lead to a blind spot in its other parts. Let me try and explain with a real life case.

It was an incredible drive…

A leading wealth management company approached us to assess their data warehousing + BI initiative which had run into a set of problems, and recommend remedial measures.
Their technology ecosystem had grown through acquisitions – multiple different applications in use, performing the same or slightly different functions. Rather than wait for the rationalization of their ecosystem in the short term, they had decided to build a data warehouse and a BI platform for their internal and external consumers to access the available data and quickly derive value from it.

They had chosen a big data technology product to build a data warehouse, which would pull data in from multiple applications to deliver BI and analytics to their end users.
In fact, they had adopted an efficient parallel implementation strategy – once the reporting & analysis business requirements had been gathered, they initiated parallel tracks – one for building the BI components, another for data acquisition (schema-on-read style).

Data acquisition was surprisingly fast – in fact, much faster than traditional mechanisms. In some cases, they even went ahead and acquired most data from a source because it was so easy. Transformation of data for use in reporting was not very onerous either because of certain technology choices. On the other side, their reports and dashboards were being rapidly developed in parallel using an off-the-shelf BI product.

The first set of dashboards started getting data in about 5 months from initiation. If you think about it, that is an incredible time to market for a data warehousing program!

till we hit the data in the blind spot

However, the business teams started reporting problems – the dashboards / reports were working mostly as expected in terms of functionality, but there were problems with the results. Some of these problems were serious.

For example, when they evaluated data from their source systems as a part of their data investigations, they found 200+ account types. Even considering the duplicate values across the systems, there were 100 ‘unique’ account types that had to be contended with (For reference, anything more than 40 unique basic account classifications should send the data quality radars buzzing.) In other words, a large data rationalization problem.

They were asked to support DOL ERISA Fiduciary rule related analysis by their Strategy team. You can imagine how difficult this was with the way account type information was being made available in the dashboards. Imagine selecting from a list of 150 ‘account types’ in a screen to run a report or see a chart.
There were more such examples. Feedback received from business strongly suggested that the experience of consuming the data via the platform was very sub-optimal – both in terms of data accuracy, as well as the overall data-related user experience.
In other words, the program had hit a data blind spot!

Did someone move our columns?

In the traditional ETL (‘schema-on-write’) model, we have to design for data to come into the warehouse once the requirements have been analyzed. This design phase involves inventorying the data in the various sources and building the data schema(s) in the warehouse. This phase typically requires a lot of diligence, especially in complex programs.

In addition to schema definition, this phase forces the organization to acknowledge data related issues (data quality, reference & master data management, data harmonization across sources) head-on, investigate and remediate them. This data remediation involves substantial collaboration between business and IT SMEs.

This activity is one of the (if not the) largest contributors to any data warehouse implementation program plan in terms of schedule.

We were hypnotized by that yellow elephant

The schema-on-read approach has brought about a change in the process.
This change may seem subtle. However, in our observations, this change plays an important part in introducing a data blind spot in the program.

Because data acquisition into the warehouse has been made so easy, it often precedes data design activities in implementations. Because of the promised cost effectiveness of data storage in Hadoop (as opposed to traditional data warehouses), there is also a great temptation to bring all data from sources into the big data platform.

Because data is now acquired and ‘available’, it is very tempting to start transforming it for end-use. And this transformation often precludes any data design and data remediation activities.
While the ‘acquire and transform’ approach does have certain benefits when executed well, what it is also likely to do is convey a sense of progress in the program that is artificial. I call it artificial progress because it has been achieved by ignoring / bypassing some very critical activities in the data pipeline – i.e., investigation and remediation of data problems.

In other words, the schema-on-read / ELT approach does not make the data related problems go away. They just pop up downstream in the pipeline, most likely when the data is being consumed.

Planning better for the next journey

Even after adopting a schema-on-read / ELT approach, we have observed that the true critical path of the warehouse implementation continues to include data analysis, design and remediation activities.
These activities cannot be bypassed, whether in the ETL-based approach or in the ELT-based one. Data design and remediation, if not conducted before making the data available for consumption, are likely to make the data related issues manifest themselves in reports, dashboards and other business-facing experiences. In other word, their order in the critical path can be changed (often, involuntarily) but they cannot be eliminated from the critical path.

As a part of our assessment, we have recommended that the company continue with the existing ELT approach to their data pipeline, but incorporate a split data design phase in its process.
Once the data has been acquired, it will undergo data design, investigation and remediation for multiple aspects (Design I + Remediate) – data quality issues, metadata matching across multiple sources to a common canonical data model, data value harmonization across sources – for all planned and anticipated dashboards and reports. These activities will have to be brought into the equation every time new requirements are received from data consumers.

After the data has been remediated for quality, it can then enter a use-case specific design (Design II) and transformation process. In this process, the data is ‘designed’ for the specific requirements of each report or dashboard, and then transformed accordingly within their big data platform.
However, the availability of business and IT SME expertise in data remediation continues to be a necessary condition for success.

With this approach, the organization will continue to benefit from the efficiencies in data acquisition that the ELT philosophy makes available. In addition to increased efficiency in acquisition itself, this approach will also provide the investigation & remediation team a broader view of the data across multiple sources. This, in turn, should bring additional efficiencies in the remediation process too.
In other words, a strategy to help them eliminate their data blind spot!

The author is Vice President – Financial Services Solution and Incubation, Incedo Inc.

data analytics