Security Analytics and Reporting
A Fortune 50 financial institution had a legacy pipeline that aggregated batched data onto a secured Hadoop cluster to create daily aggregates and reports. While it performed multiple transformations which created new datasets, the organization faced multiple issues:
• The data pipeline was inefficient, taking up to six hours to run and requiring manual intervention almost daily
• Reports were misaligned with day boundaries
• Any points of failure required reconfiguring and restarting the pipeline, which was time consuming and frustrating
• Major setup and development time was needed to add new sources
• The team was unable to test and validate the pipeline prior to deployment, so testing was conducted directly on the production cluster - a poor use of resources
Using CDAP, the organization’s data development team created independent, parallel pipelines that moved the data from SQL into Time Partitioned Datasets. Transformations were then performed in-flight with the ability to handle error records. After completing the initial transfers, another pipeline combined the data into a single Time Partitioned Dataset and fed it into an aggregation and reporting pipeline.
Benefits of Cask Solution
- In-house Java developers with limited knowledge of Hadoop built and ran the complex pipelines at scale within two weeks after only four hours of training.
- The new data pipeline took approximately two hours – compared to 6 hours before – to run without any manual intervention.
- Transforms were performed in-flight with the ability to handle error records.
- The visual interface enabled the team to develop, test, debug, deploy, run, automate and view pipelines during operations.
- The development experience was improved by reducing unnecessary cluster utilization.
- Tracking tools made it easy to rerun the process from any point of failure.
- The new process reduced system complexity, which simplified pipeline management.