Agile Data Preparation

Enterprises are deploying Hadoop for many reasons, such as low cost, scalability, and flexibility. With these features, the Hadoop Distributed File System (HDFS) lends itself to big data analytics applications and a potentially revolutionary use case known as the data lake, where analysts and enterprises have the opportunity to produce new insights and drive business decisions. Doing data science in the data lake in particular is a highly iterative and exploratory process, and practitioners testify that turning raw data into a form suitable for analysis often takes up seventy percent or more of their time.

Agile data preparation reduces the time required to prepare data for analysis and enables analysts to work more effectively with “big data”. Agile data preparation allows analysts to engage with the data up front and then iterate to the right schema or shape to meet the analytic requirement. Although Hadoop provides the fundamental data storage and processing engine, agile data preparation also calls for new kinds of tools for working with data.

Read our whitepaper, "Data Preparation in the Data Lake", to learn more about why agile data preparation is key for success when doing advanced analytics with Hadoop.

Hadoop Metadata Management

Before a Hadoop stakeholder even gets to work with the data, the Hadoop Distributed File System (HDFS) is a complex data environment with many files, diverse file types, and a lack of metadata. Although the Hadoop ecosystem includes powerful tools for processing and analyzing large amounts of data, capabilities for managing metadata are limited. The information management framework behind legacy data platforms is nowhere to be found. The result is that analysts are left without simple ways to find, understand, organize, and shape data, and enterprises are left to watch the data lake become a data swamp. 

Hadoop metadata management capabilities empower all stakeholders to maximize the return on investment from a lower-cost, more powerful data platform.

Vist our Get started page to read about the Loom API or download the produdct and try it out for yourself.

Loom Overview

Loom is the complete solution for getting the most from your data lake. Loom empowers business analysts, data scientists, and data engineers to work interactively with big data to prepare it for advanced analytics. Loom increases productivity for anyone working with Hadoop-based data.

Loom is the only complete Hadoop metadata management solution in the market. Loom takes metadata management into the big data era to enable effective management of the data lake and other big data analytics applications. The extensible metadata registry is populated automatically as Loom scans Hadoop for new files. Close integration with Apache Hive complements Loom Weaver to provide the ultimate tool for agile data preparation. Loom can capture any important metadata about data and processes in Hadoop and across the Big Data Architecture.

Loom provides access to Hadoop data and metadata through an open REST API. Built on the API, the Loom Workbench is a simple browser-based UI for working with data and metadata in Hadoop. In addition, the built-in RLoom package provides convenience functions for managing and processing data from the R statistical programming environment.

Hadoop Distributions Cloudera, Hortonworks, MapR, Pivotal, IBM, Intel, Apache
Security Kerberos, Single Sign-on (JAAS)

Next Steps