Unscrambling the data egg

Everybody knows that it is not possible to unscramble a scrambled egg. But when it comes to the veracity of data that is presented either in on-screen forms, views, reports, or extracts it is essential to be able to unscramble the data to make intelligible sense of and have confidence in the data.

Disassembling and Unscrambling

The most common approach for determining data lineage traces data flows in a reverse manner to identify individual data mappings. This method starts with the target data end point and disassembles data that have been combined, transformed, or aggregated. To unscramble the data, it then proceeds to traverse the processes, applications, and ELT/ETL tasks in reverse from the target for each of the disassembled data components. This process is repeated until for each segment of the flow the data origins are established.

Whilst this approach produces a collection of metadata lineage maps it only focuses on those metadata pipelines that are involved in populating the target system under scrutiny. It does not necessarily provide a comprehensive view of ALL information flows and how they interact.

In addition, whilst a manual process for determining lineage may be acceptable as a one-off exercise it is unlikely to stay current and remain aligned with any changes to the environment or metadata sources. For the sake of speed and efficiency, it is preferable to automate and examine the environment on a regular basis for any changes to the metadata sources.

It should also be noted that this process produces a technical view of the information flow, but it does not necessarily provide semantic lineage i.e how the metadata assets map to the vocabulary of the business.

Metadata and Mapping Catalog

In considering these matters an alternative approach to establishing metadata lineage combines

  • A business glossary that provides the semantic meaning and link to metadata
  • A metadata catalog that holds the inventory of metadata assets
  • A metadata mapping framework that records connections between the metadata assets

The business glossary and metadata catalog define and disambiguate metadata to provide a clear meaning of metadata definitions and business terminology. This is essential to provide a true and correct understanding of information and to provide a single meaningful source of definition of the metadata.

A metadata catalog approach lends itself to automation where platform-specific metadata connectors can scan the environment and systems to harvest and catalog metadata from the specified assets.

Similarly, further automation that can parse code involved in the movement of data, such as ELT/ETL triggers, procedural code etc. (essentially any executable code that moves data from a source to a target data point) can be used to reverse engineer information to determine the mappings between source and target metadata sets. In cases where the mappings cannot be determined automatically, a tool can help to manually create mappings in the catalog. The result is a mapping and metadata catalog that incorporates the structural and semantic metadata associated with each data asset as well as the direct mappings for how that data set is populated.

This approach of having a combined mapping and metadata catalog affords a powerful approach in that it allows users to dynamically determine lineage from the documented mappings. It allows all points of the metadata lineage to be determined on demand: the source and target data points, the sequences of processing stages, and metadata transformations.

Watch our webinar showing how data mapping and data intelligence can help your organisations data landscape here.

Blog inspired by David Loshin – Top 5 Data Catalog Benefits: Understanding Your Organization’s Data Lineage