The first unified integration platform for big data that cuts down the time to production for data applications and data lakes by 80%.

“The team at Cask do some fantastic work in product and embrace an open source strategy that’s just what developers need to help them build big data applications on Cloudera.”
– Mike Olson, Co-founder and Chief Strategy Officer, Cloudera
Cask Data Application Platform CDAP 4.1 Now Generally Available!

Capabilities

Data Integration

Data Pipelines

CDAP provides a data ingestion service that simplifies and automates the difficult and time consuming task of building, running, and managing data pipelines. The studio allows you to drag-and-drop various sources, transforms, analytics, sinks, and actions.

  • Drag & drop graphical studio with various sources, transforms, analytics including machine learning algorithms, sinks, and actions
  • Unified interface to preview, debug, deploy, run and manage data pipelines
  • Separation of logical data pipeline vs execution environment – making it easy to run pipeline as MapReduce or Spark (or combination in future)
  • Extensible pluggable architecture for integrating with new sources, sinks and transforms for processing

Data Preparation

CDAP provides an easy and interactive way to visualize, transform, and cleanse data. It helps data scientists and data engineers derive new schemas and operationalize the data preparation with a few clicks.

  • Easy and interactive way to work with messy data
  • Apply transformations using merge, delete, and substring operations
  • Quickly visualize pattern both within and across columns
  • Operationalize effortlessly into production pipeline

App Development

High-level, Easy-to-Use APIs and Reusable Libraries

CDAP provides an integrated application development framework for Hadoop. It provides standardization and deep integration with diverse Hadoop technologies with easy-to-use APIs to build, deploy and manage complex data analytics applications in the cloud or on-premise.

  • High-level Java APIs help maximize developer productivity, reducing the time to deliver big data solutions
  • Build, test, and run distributed applications across their entire lifecycle
  • Open, standards-based architecture, and REST APIs to integrate and extend existing infrastructure
  • Automate deployment and monitoring of solutions in Continuous Integration/Continuous Deployment using comprehensive DevOps tools

Metadata & Lineage

Harvest, Index and Track Datasets

  • Automatic capture of technical, business, and operational metadata providing richer application-level data about datasets and programs
  • Reliably index and search metadata with easy-to-use interface
  • Track lineage by understanding changing datasets and flow of data
  • Gain deep insights into how your datasets are being created, accessed, and processed with built-in usage analytics capabilities
  • Apply multi-dimensional usage analytics to understand complex interactions between users, applications, and datasets

Supports Standardization, Governance and Compliance Needs

  • Audit log for easy traceability for data quality and compliance needs
  • Data dictionary to define and enforce common descriptions of data across datasets to enforce a common naming convention, type, and indicate if the column contains PII data
  • Empower business users to tag and classify data to provide business context
  • Integrates with Metadata Management systems (e.g. Cloudera Navigator) for centralizing metadata repository, to deliver accurate, complete, and correct data
  • Maintain consistent definitions of metadata containing information about data to reconcile difference in terminologies

Security & Operations

Robust Security and a Portable Production Runtime Environment

  • Sophisticated security, authentication, authorization, and encryption for compliance needs and to mitigate risk
  • Deep enterprise integrations for security and authentication, such as LDAP, Active Directory, Kerberos, JASPI,Apache Sentry and Apache Ranger
  • Isolation of data and operations from users and push down of access control to lower layers
  • Robust production runtime environment for easy, secure deployment and management on Hadoop
  • High availability, disaster recovery and replication to support production business-critical usage

Architecture

Abstraction, Standardization and Future-Proofing

CDAP provides a container architecture for your data and applications on Hadoop. High-level abstractions and deep integrations with diverse Hadoop technologies dramatically increase productivity and quality in order to accelerate development and reduce time-to-production to get your Hadoop projects to market faster.

  • 100% open source, 100% Hadoop native
  • Flexible, multi-tenant deployment capabilities to accommodate shared data and application infrastructure
  • Packaging of data and applications simplies the full production lifecycle
  • Encapsulation of data and programs stored and running in systems like HDFS, HBase, Spark, and MapReduce enables portability of big data solutions on-premises, in the cloud and for hybrid environments
  • Standardization of data in varied storage engines and compute on varied processing engines promotes reusability and simplified security, operations, and governance across projects and environments
  • Maximum flexibility and reduced risk with insulation from changes in the fast evolving big data ecosystem

Data Containers

CDAP Datasets provide a standardized, logical container and runtime framework for data in varied storage engines. They integrate with other systems for instant data access and allow the creation of complex, reusable data patterns.

Program Containers

CDAP Programs provide a standardized, logical container and runtime framework to compute in varied processing engines. They simplify testing and operations with standard lifecycle and operational and can consistently interact with any data container.

Application Containers

CDAP Applications provide a standardized packaging system and runtime framework for Datasets and Programs. They manage the lifecycle of data and apps and simplify the painful integration and operation processes in heterogeneous infrastructure.

Benefits

Self-Service with Guardrails

CDAP enables IT to create a self-service experience for data ingestion to data delivery with minimal intervention, while putting in the necessary “guardrails” for enterprise oversight and control.

Build Once, Run Anywhere

CDAP encapsulates data access patterns and business logic to enable portability and reusability across on-premises, cloud, and hybrid environments on all major Hadoop distributions and cloud providers.

Enterprise-Ready

CDAP is an open and standards-based architecture that provides extensive security, compliance and resiliency features to support the scale and risk profile of a modern enterprise big data platform.

Want to see CDAP in action? Request a personalized demo. >>