CDAP for Spark
With so much data being processed by enterprises everyday, it’s essential to stream and analyze it in real-time. Apache Spark provides a framework for advanced analytics right out of the box, including a tool for accelerated queries, a machine learning library and a streaming analytics engine. Its pre-built libraries are easier and faster to use rather than having to implement these analytics via MapReduce, which requires specialized skills.
However, many enterprises fail to take advantage of Spark's sophisticated processing capabilities because of the absence of tools and reusable components that would increase efficiencies when creating multiple Spark applications. Spark also lacks key features required by enterprises, including built-in data governance. Without the ability to audit metadata, it is difficult to track the data lifecycle, which results in developers having to create custom-built data tracking solutions.
Furthermore, Spark is unable to support globally consistent transactions, which prevents use cases where time-ordered writes to the database are critical. Finally, Spark does not aggregate logs in a central location, which means developers and DevOps engineers must dig through each node to debug applications.
Cask Tracker provides an out-of-the-box governance solution, giving Spark users access to metadata as well as audit trail and data lineage analysis.
Packaged into CDAP, Apache Tephra, a transaction engine that supports multi-versioning and rollback, provides globally consistent transactions on top of Spark.
Rapid Time to Value
Cask Data Application Platform (CDAP) simplifies the process of building and debugging a Spark application by offering reusable building blocks and aggregating all Spark logs and making them viewable in real time.
CDAP runs each Spark process in a container, allowing developers to run Spark and other Hadoop processes in parallel on the same cluster.