Managed Data Lake

<< Back to Solutions

The concept of a data lake is frequently challenging for data and analytics leaders to understand. For some, the data lake is simply a staging area that is accessible to several categories of business users. To others, the "data lake" is a proxy for "not a data warehouse.” While data warehousing is highly structured and needs schema to be defined before data is stored, data lakes take the opposite approach. They collect information in a near-native form without considering semantics, quality, or consistency. Data collected in data lakes is defined as schema-on-read, where the structure of the data collected is not known upfront, but needs to be evaluated through discovery when it is read. They unlock value which was not previously attainable or was hard to achieve. Even with the multiple benefits that data lakes provide, there are substantial challenges because:

  • Technology is complex with lots of open source alternatives
  • Big data talent is expensive
  • Point-solutions are limited and cannot be easily put in production, often requiring custom integration code
  • Petabytes of data distributed across a cluster complicates operations, security, and data governance
  • Lack of structure and pre-processing can result in data swamps

A significant part of improving the adoption of data lakes is to provide comprehensive, built-in support to both deploy and maintain a data lake over time. At Cask, we have created a Unified Integration Platform for Big Data that provides all aspects of data lake management, including data integration, metadata and lineage, security and operations, and app development on Hadoop and Spark. This means that companies can focus on application logic and insights instead of infrastructure and integration.

Unified Integration Platform

CDAP is a unified integration platform which integrates application management, data integration, security and governance, and a self-service environment, speeding up the process for building and running a data lake. CDAP provides a broad set of ecosystem integrations for runtime, transport, and storage, including MapReduce, Spark, Spark Streaming, Tigon, Kafka, and HBase.

Mitigates Risk

CDAP provides a comprehensive collection of pre-built building blocks to support data manipulation, data storage, and key insight extraction to build smarter end-to-end solutions with maximum flexibility in the fast evolving big data ecosystem. This mitigates risk by empowering users to quickly go from Hadoop ideation to deployment using our CLI or sleek visual interface, reducing cost and delays.

Rapid Time to Value

CDAP enables developers to get started quickly with built-in data ingestion, exploration, and transformation capabilities available through a rich user interface and interactive shell.

Ensures Data Consistency

CDAP makes all data in Hadoop available for access in real-time and batch without the need to write code, manage metadata, or copy data. Advanced functionality for scale-out, high-throughput, real-time ingestion and transactional event processing while maintaining data consistency enables new use cases.

Want to see CDAP in action? Request a personalized demo. >>