Automated Hadoop Data Management
Features of Activescan
- Automatically detects new files and directories
- Automatically determines the file type and format
- Automatically defines and registers schemas
- Stores all metadata in centralized registry
For people producing an application running on data that has been loaded into HDFS, the first obstacle they face is how to get the data from various files in HDFS into the “schema” required by the application.
One of the main reasons Hadoop is in such a rapid adoption cycle is that it can efficiently process unstructured and semi-structured data. One important point not usually mentioned when its ability to process unstructured data is being touted, is that the processing is often accomplished by transforming the unstructured and semi-structured data into structured data and then using pre-built query processing engines like Hive and Hbase or even R servers. When files are initially loaded into the Hadoop File System (HDFS) they have whatever structures are given by the system that created the files - apache log servers, XML, CSV, text, JSON, etc.
Even if all of the files required by a data scientist are loaded into HDFS from a collection of relational systems, the schemas of each file will be different. So, in both cases, working with collections of unstructured files or collections of relational files, significant transformation work must be accomplished before the files can be used to support either rudimentary or advanced analysis. To date this activity consumes much of the time, up to 70%, of data scientists and application developers working with data in Hadoop.
Revelytix has produced significant new technology to make the task of working with collections of files in Hadoop substantially easier. Our product, Loom, includes a feature called ActiveScan which scans HDFS for new files (a new file is any file not already registered with Loom) and when it finds a new file, automatically transform it into the structure required for analysis. This is accomplished by two sub-components of ActiveScan - classifier and formatter. A classifier will introspect a file in HDFS and determine what type of file it is - Apache log file, csv, etc. Once it knows what type of file it has detected it can parse the file and load it into Hive for further processing.. Both the classifier and formatter are pluggable frameworks so that if a classifier for some sort of file is not available in Loom either the customer of Revelytix can create one, same for formatters.
The Activescan framework in Loom also profiles any new files it detects automatically generating much critical metadata about each file - number of rows in tables, number of null values, min, max and mean, standard deviation, file location and format, file partitions, table and column definitions. From this metadata about each file in HDFS we are able to determine the type of file so that a format can be applied specific to its type. Once the type and format are applied, the file can be registered in the Loom registry as a dataset where it becomes a managed asset that can be easily discovered and used by data scientists and application developers.
Loom offers access to both the data in managed datasets and the metadata associated with each dataset available to users either through the Loom API or through Loom’s Hive interface. Loom supports HiveQL and SQL for access to data in the datasets and numerous REST style methods to retrieve or write metadata to the Loom registry via the Loom API.
Managing and working with data in Hadoop via Loom greatly simplifies life for data scientists and application developers.