Dataset Management for Hadoop

Loom is an enterprise information management system for the Hadoop ecosystem

Data Lineage and Auditing

Loom keeps track of all activity within Hadoop.

  • Automatically captures detailed auditing information about all Hadoop activity
  • Timestamps all changes to all Loom data sets and records user initiating all changes

Provenance

Loom makes sure you understand where your data came from.

  • Enables extensible descriptions of all assets
  • Lets you describe the relations amongst data sets and other assets, making it easier to find relevant data for data science projects
  • Data owners, data stewards, system of record

Productivity

Loom enables data scientists and IT to build more analytics faster.

  • Easy-to-use interfaces
  • Find the right data for the right job quickly
  • Supports existing methodologies
  • Transform data using multiple standard Hadoop tools

Business Unit Views of Enterprise Information

Loom provides an enterprise level metadata model that can be extended by individual business units.

  • Easily rolled up to the enterprise view

Low Cost

Loom is built on open-source Hadoop and R.

  • Runs on commodity hardware
  • Data storage and processing using Hadoop and other open-source tools

Analytic Activity Monitoring

Loom provides out-of-the box reporting capabilities for all analytic activities.

  • Micro-reports:
    • On the page for a dataset, view the Jobs associated with the dataset
      • One section for jobs where the dataset is an input
      • One section for jobs where the dataset is an output
    • On the page for a query, view the Jobs associated with the query
    • On the page for a dataset, view the Datasets that the dataset was derived from
      • One section for datasets that are 1-transformation upstream
      • One section for all datasets upstream
    • On the page for a dataset, view the Datasets which are derived from it
      • One section for datasets that are directly (1-step) downstream
      • One section for all datasets downstream (derived from this one)
  • Navigation
    • Whenever there is a relationship to another entity in the registry, that relationship should be navigable (via hyper-link)
  • Macro-reports (reports can be scoped in many ways):
    • Dataset-Transformation (currently Dataset-Query)
      • Datasets used by transformations (queries)
      • Transformations (queries) that use datasets
    • Job-Query-Dataset
      • Transformations (queries) that were run in Jobs
      • Datasets that were used in Jobs
      • Jobs that use transformations (queries)
      • Jobs that use datasets
    • Dataset-Dataset
      • Datasets that produced Datasets
      • Datasets that are transformed from Datasets
    • Dataset source-sink
      • Datasets imported into the system ('source' level)
      • Datasets exported from the system ('sink' level)