Building an enterprise data lake requires a reliable, repeatable and fully operational data management system, which includes ingestion, transformations, and distribution of data. It must support varied data types and formats, and it must be capable of capturing the data flow in various ways. The system must do the following:
• Transform, normalize, harmonize, partition, filter and join data
• Interface with anonymization and encryption services external to the cluster
• Generate metadata for all data feeds, snapshots and datasets ingested, and make it accessible through APIs and web services
• Perform policy enforcement for all ingested and processed data feeds
• Track and isolate errors during processing
• Perform incremental processing of data being ingested
• Reprocess data in case of failures and errors
• Apply retention policies on ingested and processed datasets
• Setup common location format (CLF) for storing staging, compressed, encrypted and processed data
• Filter views over processed datasets
• Monitor, report and alert based on thresholds for transport and data quality issues experienced during ingestion. This helps provide the highest quality of data for analytics needs
• Annotate datasets with business/user metadata
• Search datasets using metadata
• Search datasets based on schema field names and types
• Manage data provenance (lineage) as data is processed/transformed in the data lake
Benefits of Cask Solution
- The company’s non-Hadoop developers were able to build an end-to-end data ingestion system without training, saving time and resources.
- Developers were able to build the data lake and get it to customers faster.
- CDAP’s ingestion platform standardized and created conventions for how data is ingested, transformed and stored, allowing faster on-boarding.
- Developers provided a self-service platform for the rest of the organization, enabling departments to use data to make better business decisions.
- CDAP was installed in eight clusters with hundreds of nodes.
- Using Cask Tracker, data lake users were able to quickly locate and access datasets and metadata, data lineage and data provenance. This allowed them to efficiently utilize their clusters, aided them in data governance and auditability and improved data quality.