Building a (big) data pipeline the right way

Data collection and analysis has been the craze of business for quite some time. Yet all too often, the former takes hold of companies with such force that the idea of ​​using data is neglected. There’s a reason we had to come up with a name for this phenomenon: “dark data.”

Unfortunately, data is often collected for no good reason. It’s understandable – a lot of internal data is collected by default. Today’s business climate requires the use of many tools (eg CRM, accounting records, invoicing) that create reports and store data automatically.

The collection process is even broader for digital businesses and often includes server logs, consumer behavior, and other tangential information.

Building a (big) data pipeline the right way

Unless you’re in the data-as-a-service (DaaS) business, simply collecting data doesn’t offer any benefits. With all the hype surrounding data-driven decision making, I think a lot of people have lost sight of the forest for the trees. The collection of all forms of data becomes an end in itself.

In fact, this approach is costing the company money. There is no free lunch – someone has to set the collection method, manage the process, and control the results. That is wasted resources and finances. Instead of fighting over the amount of data, we should look for ways to simplify the collection process.

Humble beginnings

Virtually every business begins their data acquisition journey by collecting marketing, sales, and account data. Certain practices like pay-per-click (PPC) have proven to be incredibly easy to measure and analyze through the lens of statistics, making data collection a must. On the other hand, relevant data is often produced as a by-product of the usual daily activities in sales and account management.

Businesses have already realized that sharing data between marketing, sales, and account management departments can lead to great things. However, the data pipeline is often clogged and relevant information is only accessed abstractly.

The way departments share information often lacks immediacy. There is no direct access to the data; instead, it is shared through in-person meetings or discussions. That is not the best way to do it. On the other hand, having constant access to new data can provide departments with important information.

Interdepartmental data

Unsurprisingly, cross-departmental data can improve efficiency in many ways. For example, data on the ideal customer profile (ICP) between departments will lead to better sales and marketing practices (for example, a more defined content strategy).

This is the burning problem for all companies that collect a large amount of data: they are scattered. Potentially useful information is left in all spreadsheets, CRM and other management systems. Therefore, the first step should be to not obtain More data but to optimize current processes and prepare them for use.

Combining data sources

Fortunately, with the advent of Big Data, companies have been thinking about information management processes in great detail. As a result, data management practices have come a long way in recent years, greatly simplifying optimization processes.

Data warehouses

A commonly used data management principle is the creation of a warehouse of data collected from numerous sources. But of course the process is not as simple as integrating a few different databases. Unfortunately, data is often stored in incompatible formats, necessitating standardization.

Typically, data integration in a warehouse follows a 3-step process: extract, transform, load (ETL). There are different approaches; however, ETL is probably the most popular option. Extraction, in this case, means taking data that has already been acquired from internal or external collection processes.

Data transformation is the most complex process of the three. It involves adding data from several formats into a common one, identifying missing or repeating fields. In most companies, doing all of this manually is out of the question; therefore, traditional programming methods (for example, SQL) are used.

Loading – Moving to the warehouse

Loading is basically moving the prepared data to the warehouse in question. While it is a basic process of moving data from one source to another, it is important to note that warehouses do not store information in real time. Therefore, separating operational databases from warehouses allows the former to be separated for backup and avoids unnecessary damage.

Data warehouses typically have some critical characteristics:

  • Integrated. Data warehouses are an accumulation of information from heterogeneous sources in one place.
  • Time variant. The data is historical and is identified within a particular time period.
  • Not volatile. Previous data is not removed when newer information is added.
  • Theme oriented. The data is a collection of information based on topics (staff, support, sales, revenue, etc.) rather than being directly related to ongoing operations.

Create a (big) data pipeline

External data to maximize potential

Building a data warehouse is not the only way to get more of the same amount of information. They help with interdepartmental efficiency. Data enrichment processes can help with intradepartmental efficiency.

Data enrichment from external sources

Data enrichment is the process of combining information from external sources with internal sources. Sometimes enterprise-grade companies can enrich data from purely internal sources if they have enough different departments.

While warehouses will work almost identically for almost any business dealing with big data, each enrichment process will be different. This is because enrichment processes are directly dependent on business objectives. Otherwise, we would go back to the starting point, where the data is collected without a proper end goal.

Inbound lead enrichment

A simple approach that could be beneficial to many companies would be inbound lead enrichment. Regardless of the industry, responding quickly to requests for more information has increased sales efficiency. Enriching leads with business data (for example, public company information) would provide the opportunity to automatically categorize leads and respond to those closest to the ideal customer profile (ICP) faster.

Of course, data enrichment doesn’t have to be limited to sales departments. All kinds of processes can be powered by external data, from marketing campaigns to legal compliance. However, as always, the details must be considered. All data must have a business purpose.


Before entering complex data sources, cleaning up internal processes will bring better results. With dark data comprising more than 90% of all data collected by companies, it is best at first to look inward and optimize current processes. Including more sources will exile potentially useful information due to inefficient data management practices.

After building robust systems for data management, we can move on to complex data collection. Then we can be sure that we won’t miss anything important and we can match more data points to get valuable information.

Image credit: rfstudio; pexels; Thank you!

Julius Cerniauskas

CEO at Oxylabs

Julius Cerniauskas is the Lithuanian tech industry leader and CEO of Oxylabs, covering topics on web scraping, big data, machine learning, and technology trends.

Leave a Reply

Your email address will not be published. Required fields are marked *