Security Analytics and Reporting
A Fortune 50 financial institution had a legacy pipeline that aggregated batched data onto a secured Hadoop cluster to create daily aggregates and reports. While it performed multiple transformations which created new datasets, the organization faced multiple issues:
• The data pipeline was inefficient, taking up to six hours to run and requiring manual intervention almost daily
• Reports were misaligned with day boundaries
• Any points of failure required reconfiguring and restarting the pipeline, which was time consuming and frustrating
• Major setup and development time was needed to add new sources
• The team was unable to test and validate the pipeline prior to deployment, so testing was conducted directly on the production cluster - a poor use of resources
Using Cask Hydrator, the organization’s data development team created independent, parallel pipelines that moved the data from SQL into Time Partitioned Datasets. Transformations were then performed in-flight with the ability to handle error records. After completing the initial transfers, another pipeline combined the data into a single Time Partitioned Dataset and fed it into an aggregation and reporting pipeline.
Benefits of Cask Solution
- In-house Java developers with limited knowledge of Hadoop built and ran the complex pipelines at scale within two weeks after only four hours of training.
- The new data pipeline took approximately two hours – compared to 6 hours before – to run without any manual intervention.
- Transforms were performed in-flight with the ability to handle error records.
- The visual interface enabled the team to develop, test, debug, deploy, run, automate and view pipelines during operations.
- The development experience was improved by reducing unnecessary cluster utilization.
- Tracking tools made it easy to rerun the process from any point of failure.
- The new process reduced system complexity, which simplified pipeline management.