Data ingestion is a crucial step in data engineering. Data engineers load huge amounts of data into various database systems for further transformation and processing. While dealing with relatively small amounts of data on staging we are in luck not running out of memory, working on production data pipelines with terabytes (or even petabytes) of records often turns into a real challenge. Existing ETL solutions offer automated data loading into a data warehouse we need and often have row-based pricing models. In this story, I would like to discuss how to create a bespoke data-loading solution for our pipelines to enable efficient data loading. We will take a better look into common data ingestion design patterns and typical ways to organise the process. We will reverse-engineer some of the most popular ETL solutions to see how data can be ingested without outages and losses efficiently. I will provide data-loading examples using Python libraries and tools available in the market for free to summarise my findings.
On a scale from 1 to 10 how good are your data loading skills? –
That would be one of my favourite questions during data engineering interviews. I keep looking for talents who know how to build bespoke ETL systems.
Indeed, being able to create a robust data loading system that can process data efficiently, doesn’t fail, doesn’t consume too much memory, can handle various data formats and scales well — this is what marks an experienced data engineer in my opinion. With the abundance of tools available in the market for ETL tasks, we are in luck and don’t really need this. Until the company decides to build this in-house. There might be various reasons for that and one of the obvious ones is security and regulations. Dealing with sensitive data is always challenging and often data must not leave certain regions and/or geographical locations. Another good reason to develop ETL expertise internally is that it saves tons of money in the long run. Having an all-hands software engineer who is experienced with data platform design and knows many ETL tools and frameworks is always great. Companies are hunting for those talents. I…