Evolution of smart data pipelines

The potential of artificial intelligence (AI) and machine learning (ML) appears nearly limitless in their ability to derive and drive new sources of customer, product, service, operational, environmental, and social value. If your organization is going to compete in the economy of the future, then AI must be at the center of your business operations.

A study by Kearney titled “The impact of analytics in 2020”Highlights untapped profitability and business impact for organizations looking for a justification to accelerate their investments in data science (AI / ML) and data management:

  • Explorers could improve profitability by 20% if they were as effective as leaders
  • Followers could improve profitability by 55% if they were as effective as leaders
  • Laggards could improve profitability by 81% if they were as effective as leaders

The business, operational, and societal impacts could be staggering except for one significant organizational challenge: data. No one less than the godfather of AI, Andrew Ng, has noted the impediment of data and data management in empowering organizations and society to realize the potential of AI and ML:

“The model and code for many applications are basically a solved problem. Now that the models have advanced to some degree, we have to make the data work too. “- Andrew Ng

Data is the heart of AI and ML model training. And high-quality, reliable data orchestrated through highly efficient and scalable pipelines means that AI can enable these compelling business and operational results. Just as a healthy heart needs oxygen and reliable blood flow, so too is a constant stream of clean, accurate, rich, and reliable data important to AI / ML engines.

For example, a CIO has a team of 500 data engineers managing more than 15,000 extract, transform, and load (ETL) jobs that are responsible for acquiring, moving, adding, standardizing, and aligning data across hundreds of purpose data repositories. special (data marts, data warehouses, data lakes and data warehouses). They are performing these tasks in the organization’s customer support and operating systems under ridiculously strict service level agreements (SLAs) to support their growing number of diverse data consumers. It seems that Rube Goldberg certainly must have become a data architect (Figure 1).

Figure 1: Rube Goldberg data architecture

Reducing the debilitating spaghetti architecture structures of static, one-time, and special-purpose ETL programs for moving, cleaning, aligning, and transforming data greatly inhibits the “turnaround time” required for organizations to take full advantage of unique economic features. of the data. the “the most valuable resource in the world” according to The Economist.

Appearance of smart data pipelines

The purpose of a data pipeline is to automate and scale common and repetitive data acquisition, transformation, movement, and integration tasks. A properly constructed data pipeline strategy can speed up and automate the processing associated with collecting, cleaning, transforming, enriching, and transferring data to downstream systems and applications. As the volume, variety, and speed of data continue to grow, the need for data pipelines that can scale linearly within cloud and hybrid cloud environments becomes increasingly critical to a company’s operations.

A data pipeline refers to a set of data processing activities that integrate business and operational logic to perform advanced data provisioning, transformation, and loading. A data pipeline can run on a schedule, in real time (streaming), or triggered by a predetermined rule or set of conditions.

Additionally, logic and algorithms can be integrated into a data pipeline to create a “smart” data pipeline. Smart pipelines are inexpensive, reusable, extensible assets that can specialize for source systems and perform the necessary data transformations to support unique data and analytical requirements for the target system or application.

As machine learning and AutoML become more prevalent, data pipelines will get smarter and smarter. Data pipelines can move data between advanced data enrichment and transformation modules, where the neural network and machine learning algorithms can create more advanced data transformation and enrichment. This includes segmentation, regression analysis, clustering, and the creation of advanced indexes and propensity scores.

Finally, AI could be integrated into data pipelines so that they can continuously learn and adapt based on source systems, required data enrichments and transformations, and evolving business and operational requirements of systems and applications. destiny.

For example: a smart healthcare data pipeline could analyze the grouping of codes of healthcare diagnostics related groups (DRGs) to ensure consistency and integrity of DRG submissions and detect fraud as the data pipeline transfers the DRG data from the source. system to analytical systems.

Discovering the business value

CIOs and CIOs are faced with the challenge of unlocking the business value of their data – applying the data to the business to drive measurable financial impact.

The ability to bring high-quality, reliable data to the right data consumer at the right time to facilitate more timely and accurate decisions will be a key differentiator for today’s data-rich companies. A Rube Goldberg system of ELT scripts and repositories focused on disparate, specialty analytics hampers an organization’s ability to achieve that goal.

Learn more about smart data pipelines at Modern business data pipelines (eBook) by Dell Technologies here.

This content was produced by Dell Technologies. It was not written by the editorial staff of MIT Technology Review.


Leave a Reply

Your email address will not be published. Required fields are marked *