Key elements to build an effective data virtualization architecture

Ravi Shankar, SVP & CMO, Denodo

By Ravi Shankar, SVP and CMO, Denodo

Is your organization prepared to implement a data virtualization architecture? A successful implementation often requires coordination between several technical pieces for capturing data, organizing it and ensuring data quality and governance.

The ultimate goal of a data virtualization architecture, which can provide integrated views of data from different source systems, is to enable users and applications to access data without having to understand the intricacies of the underlying data technology.

Sometimes that’s easier said than done, but luckily there are a set of specific components that can help ensure effective implementation and management. These range from a good abstraction tier to help hide some of the underlying complexity, a metadata management layer to help orchestrate important data virtualization processes and data quality adeptness to help identify problems and clean data. It’s also important to work out some of the governance and security issues around the underlying data and how it’s shared.

Abstraction tier:

A data virtualization architecture requires a layer of technology that acts as an abstraction layer between the user and the one or more data stacks needed in the framework. In the data analytics space, the abstraction layer usually comes in the form of the end-user tools themselves. Many analytic tool sets offer the user the ability to explore the data without needing to write queries — or even know how the underlying data technology works. Using data models, the complexity of the underlying data structure can be significantly hidden in a way that exposes only the schematic model to end users. 

This involves constructing virtual schematic models of the data. A robust analytic platform will allow such models to be written against multiple types of data stacks in a consistent manner. Try and look for tools that have the flexibility to handle all underlying data structures, technologies and stack idiosyncrasies.

Dynamic data catalog:

In order to surface data across the enterprise, it’s important to provide business-friendly data exploration and preparation capabilities using the features of a traditional data catalog.

This includes classification and tagging, data lineage and descriptions. It’s also useful to enable keyword-based search and discovery. This may also require a translation tier for mapping data labels to terms businesspeople may be more familiar with.

Model governance

The virtual data model or schema is the central element in data virtualization architecture, especially in relation to analytics. It then follows that the governance of the model definitions is central to successful virtualization. The governance of the model should include its inputs, the data in the data stack, and its outputs, analytic artifacts like reports, calculations and machine learning logic.

The model and its definition need to be fully governed through tools like proper documentation, solid security measures and definition versioning. Sanctioned watermarking to denote the quality of a published model is an example of good governance.

Metadata and semantics mediation:

The most important component of a data virtualization architecture is the metadata layer that captures syntax and semantics of source schemas. This component will need to dynamically observe schema changes and either merge or escalate differences over time. A virtual view that exposes a unified schema for analytics is also important to hide implementation differences between sources. Some may be accessed in real time, while others will need to work off of snapshots.

Often, the meaning of specific fields and codes is not described in the data itself. In order to reuse the data, metadata — including schemas, glossaries or terms, as well as governance and security rules, must be documented and retained.

Security and governance controls:

Secure sharing requires that the data virtualization architecture maintains and enforces the rules and authorities around data access, as well as retains auditable records of where data came from and where it was used. The source system cannot implement security over the data when multiple systems are being integrated, the data virtualization tier must control this access.

This isn’t only a technical capability. Some systems automatically protect data and preserve the lineage information, allowing for simpler governance with less risk than systems requiring extensive configuration for every combination of usage.

Clear governance rules:

In enterprises “it’s too hard to extract data from the legacy system” is sometimes code for “I’m not prepared to share this data,”. While removing technical barriers to sharing, it’s also important to remove the organizational barriers as well. Clear governance rules can help so each group doesn’t feel compelled to establish their own policies ad hoc.

Visibility into downstream usage can also help with trust across groups. Tracking lineage and provenance so that data owners can see where their data is being used can help surface organizational issues more rapidly.

Data quality:

In order to ensure that the data delivered by the enterprise data layer is correct, a data virtualization architecture should include validations of data and on-the-fly transformations in an agile and flexible way

Data quality capabilities will ensure that data validations are applied to any application connecting to the enterprise data layer. Filtering incorrect rows and values after applying data validation logic ensures that data consumer only receive rows with correct data. Flagging incorrect values is done by adding extra columns with a flag indicating that particular values are wrong. Restorative functions can also replace originally incorrect value through some transformation logic.

Consider graph databases:

Graph databases are making solid inroads with their enterprise clients for data virtualization projects. Products like Neo4j can be installed on top of or alongside traditional disparate data sources and used to ingest cherry-picked data from the traditional systems.

A data model for the graph is defined as attaching attributes such as tags, labels and even directional connections to other bits of incoming data sourced from a completely different location. These added flourishes allow for advanced machine learning- and AI-based applications to more easily work with the different data sets to perform predictions and glean other insights.

Having a static team that becomes the main enforcement body for the organization that enforces rules and standards for data cleanliness, data identification and any other issues related to the body of a company’s data is critical.

Without having these guidelines and stakeholders defined, most data virtualization projects at scale will be extremely hard, if not impossible, to pull off successfully.

datadata virtualizationDenodo
Comments (0)
Add Comment