Skip to content Skip to footer

A Guide to Data Engineering Infrastructure | by 💡Mike Shakhomirov | Jan, 2024


Automate resource provisioning with modern tools

Photo by Ehud Neuhaus on Unsplash

Modern data stacks consist of various tools and frameworks to process data. Typically it would be a large collection of different cloud resources aimed to transform the data and bring it to the state where we can generate data insights. Managing the multitudes of these data processing resources is not a trivial task and might seem overwhelming. The good thing is that data engineers invented a solution called infrastructure as code. So essentially it is coding that helps us to deploy, provision and manage all resources we might ever need in our data pipelines. In this story, I would like to discuss popular techniques and existing frameworks that aim to simplify resource provisioning and data pipeline deployments. I remember how at the very beginning of my data career I deployed data resources using the web user interface, i.e. storage buckets, security roles, etc. Those days are long gone but I still remember the joy and happiness when I learned that it could be done programmatically using templates and code.

Modern data Stacks

What would that be — a Modern Data Stack (MDS)? The technologies that are specifically used to organise, store, and manipulate data would be something that makes up a modern data stack [1]. This is what helps to shape the modern and successful data platform. I remember I raised this discussion in one of the previous stories.

A simplified data platform blueprint often looks like this:

Simplified Data platform blueprint. Image by author.

It usually contains dozens of different data sources and cloud platform resources to process them.

There might be different data platform architecture types depending on business and functional requirements, skillset of our users, etc. but in general infrastructure design goes in several data processing…



Source link