Hassle-free Data in Hadoop
Loom Datasets provide simple, uniform interaction with Hadoop data
- Loom Datasets have known schemas
- Loom Datasets have known formats
- Loom Datasets are defined automatically by Activescan
- Loom Datasets are actionable - transformation and analysis can be done through Hive and Loom
Hadoop's ability to handle a wide variety of data formats and types can make it very difficult for data scientists and data engineers to understand and work with the data they need. Much time is wasted by these people trying to find and understand data in Hadoop, which is extremely costly.
Activescan dynamically crawls HDFS and registers new potential sources with schemas and statistics. Users can accept these sources as-is or make manual changes to the metadata generated by Activescan. Once accepted, a source changes status from "potential" to "active", at which point it can be converted to a Loom Dataset.
Loom Datasets, then, have known schemas and structures, along with other useful metadata, including table and column level statistics. This makes Loom Datasets far easier to work with than plain HDFS files. It is simple for data scientists, data engineers, and other Hadoop users to find and understand the data they need through the Loom Workbench. The Loom API enables the same capabilities to be exposed to third-party applications, as well. Loom is distributed with a packaged for R for exactly this purpose.
Loom Datasets can be transformed through the Loom Workbench or the Loom API using either Loom's SQL engine or the Hive query engine. Hive offers robust support for SQL, with many built-in functions and a framework for defining new functions (UDFs). All transformations are recorded in Loom so that dataset lineage is always known.