In many large enterprise environments, there are data intensive information sources and a constant need to load that data into Hadoop clusters to perform complex joins, filtering, transformations and report generation. Moving data to Hadoop is a cost-effective alternative to running complex, ad-hoc queries that would otherwise require expensive execution on traditional data storage and querying technologies.
The customer had originally attempted to build a reliable, repeatable data pipeline for generating reports across all network devices accessing resources. Data was aggregated into five different Microsoft SQL Servers. Aggregated data was staged daily into a secured (KerberOS) Hadoop cluster. Upon loading the data into the staged area, transformations (rename fields, change type of field, project fields) were performed to create new datasets. The data was registered within Hive to run Hive SQL queries for any ad-hoc investigation. Once all the data was in final independent datasets, the next job began, joining the data from across all five tables to create a new table that provided a 360 degree view for all network devices. This table was then used to generate a report that fed into another job. Following are the challenges the customer faced:
• Ensuring the reports aligned to day-to-day boundaries
• Restarting failed jobs from the point of failure (reconfigure pipelines to restart failed jobs)
• Adding new sources required a great deal of setup and development time
• Inability to test the pipeline before it was deployed led to inefficient utilization of the cluster as all the testing was performed on the cluster
• Loosely federated technologies - Sqoop, Oozie, MapReduce, Spark, Hive and Bash Scripts - cobbled together were inefficient
Benefits of Cask Solution
- A team of 10 Java (non-Hadoop) developers built an end-to-end data ingestion system without extra training, saving the organization time and resources.
- In-house Java developers, with limited Hadoop knowledge, built and ran the complex pipelines at scale within two weeks after just four hours of training
- Transforms were performed in-flight with error record handling
- The visual interface enabled the team to build, test, debug, deploy, run and view pipelines during operations.
- The process reduced system complexity dramatically, which simplified pipeline management.
- The development experience was improved by reducing inappropriate cluster utilization.
- Tracking tools made it easy to rerun the process from any point of failure.