Data Cleansing and Validation
A Fortune 500 financial organization had a legacy data pipeline that performed poorly, required multiple teams to keep it operating, and was costly to maintain. Therefore, the company decided to revamp the pipeline such that it would perform data validation and corrections. Thus, a modernized pipeline was developed by in-house Java programmers using numerous complex technologies as well as CDAP, with the goal of data cleansing and validation. These developers tested and ran the replacement pipeline using the drag-and-drop visual interface in CDAP. The new data pipeline required limited coding to integrate custom regular expressions, and with it over three billion records were processed using the following procedures:
• Standardization, verification, and cleansing of USPS codes
• Domain set validation, null checks, and length checks
• Regular expression validation (email, SSN, dates, etc.)
Using CDAP, the company was able to extract data from Netezza and other SQL sources, perform complex joins and transformations, and load it into HDFS. They were then able to perform different aggregations and joins to generate the final report. Loading the final report data back into Netezza was seamless. The company’s in-house team built a data pipeline in less than a week using the drag and-drop visual interface in CDAP and was able to schedule it to run daily and report on errors, giving them the visibility into the data they needed. Beyond those capabilities, they were able to build a pipeline-level dashboard that provided them deep insights into how the offloading and report generation process was functioning.
Benefits of Cask Solution
- The in-house team built, tested, and deployed the custom data pipeline in just three days.
- The development team required only four hours of training on CDAP before launching the project.
- Java developers tested and ran the replacement pipeline using an easyto-use drag-and-drop visual interface.
- Processing the three billion records took less than 65% of the time compared to the legacy pipeline.
- The new pipeline eliminated the need for costly Hadoop experts, improved performance and decreased the number of technologies involved, thereby reducing complexity.