Lineage

Visibility, Control, and Trust in Hadoop

Loom tracks lineage of all Hadoop transformations and datasets

Key Features

  • Dataset munging - character replace, white-space removal, de-duplication, datatype transformation...
  • SQL operations - joins, filters, aggregates, UDFs...
  • Integration with third-party tools
  • Loom tracks all lineage and metadata

 

Data in Hadoop almost never originated in Hadoop, it was originated by a business application, web site or machine (sensor) and then loaded into Hadoop. Often there are numerous sources of data loaded into Hadoop from many different systems. In fact the amount of data generated by Hadoop from original files can be much bigger and more complex than the original files. After the data is loaded into Hadoop, new data is derived through numerous transformations of the original files. The chains of transformations in Hadoop can be long to the point that it is not feasible to track the origination of the data manually, it must be formally calculated as the transformations are occurring.

It is important to users of data in Hadoop be able to track the provenance of the data all the way back to the system that originated. Otherwise the user has no idea what data they are actually using, and can fall prey to junk in, junk out.

 

Provenance - information about a set of data that describes the data in enough detail for any user to know if it is the right data for their express purpose

 

Lineage - formal calculation regarding the transformations made to sets of data by computers, lineage is a subset of provenance

 

Loom provides robust features to capture both the provenance and lineage of data in Hadoop. Tracking the lineage of data through the transformation processes in Hadoop is complex. In many cases multiple files feed into the transformation and multiple files result - multiple inputs, multiple outputs. Loom calculates a lineage graph for all transformation operations and then can analyze the graph to yield graphical or tabular analytics so that users can easily determine the lineage of the data in any output sets being considered for their use case.

In some cases this level of “auditing” is required for governance or regulatory reasons, not just for usability and Loom’s formal solution to lineage calculations meet the requirements for regulatory and governance needs.