Image by storyset on Freepik
It’s a great time to break into data engineering. So where do you start?
Learning data engineering can sometimes feel overwhelming because of the number of tools that you need to know, not to mention the super intimidating job descriptions!
So if you are looking for a beginner-friendly introduction to data engineering, this free Data Engineering Course for Beginners, taught by Justin Chau, a developer advocate at Airbyte is a good place to start.
In about three hours you will learn essential data engineering skills: Docker, SQL, analytics engineering, and more. So if you want to explore data engineering and see if it is for you, this course is a great introduction. Now let’s go over what the course covers.
Link to the course: Data Engineering Course for Beginners
This course starts out with an intro on why you should consider becoming a data engineer in the first place. Which I think is super helpful to understand before diving right into the technical topics.
The instructor, Justin Chau, talks about:
- The need for good quality data and data infrastructure in ensuring the success of big data projects
- How data engineering roles are growing in demand and pay well
- The business value you can add to the organization working as a data engineer facilitating the organization’s data infrastructure
When you’re learning data engineering, Docker is one of the first tools you can add to your toolbox. Docker is a popular containerization tool that lets you package applications—with dependencies and config—in a single artifact called the image. This way Docker lets you create a consistent and reproducible environment to run all of your applications within a container.
The Docker module of this course starts with the basics like:
- Dockerfiles
- Docker images
- Docker containers
The instructor then goes over to cover how to containerize an application with Docker: running through the creation of Dockerfile and the commands to get your container up and running. This section also covers persistent volumes, Docker networking fundamentals, and using Docker-Compose to manage multiple containers.
Overall this module in itself is a good crash course on Docker if you’re new to containerization!
In the next module on SQL, you’ll learn how to run Postgres in Docker containers and then learn the basics of SQL by creating a sample Postgres database and performing the following operations:
- CRUD operations
- Aggregate functions
- Using aliases
- Joins
- Union and union all
- Subqueries
With Docker and SQL foundations, you can now learn to build a data pipeline from scratch. You’ll start by building a simple ELT pipeline that you’ll get to improve throughout the rest of the course.
Also, you’ll see how all the SQL, Docker networking, and Docker-compose concepts that you have learned thus far come together in building this pipeline that runs Postgres in Docker for both the source and destination.
The course then proceeds to the analytics engineering part where you’ll learn about dbt (data build tool) to organize your SQL queries as custom data transformation models.
The instructor works you through getting started with dbt: installing the required adapter and dbt-core and setting up the project. This module specifically focuses on working with dbt models, macros, and jinjas. You’ll learn how to:
- Define custom dbt models and run them on top of the data in the destination database
- Organize SQL queries as dbt macros for reusability
- Use dbt jinjas to add control structures to SQL queries
So far, you’ve built an ELT pipeline that runs upon manual triggering. But you certainly need some automation, and the simplest way to do this is to define a cron job that automatically runs at a specific time of the day.
So this super short section covers cron jobs. But data orchestration tools like Airflow (which you’ll learn in the next module) give you more granularity over the pipeline.
To orchestrate data pipelines, you’ll use open-source tools such as Airflow, Prefect, Dagster, and the like. In this section you’ll learn how to use the open-source orchestration tool Airflow.
This section is more extensive as compared to the previous sections because it covers everything you need to know to get up to speed to write Airflow DAGs for the current project.
You’ll learn how to set up the Airflow webserver and the scheduler to schedule jobs. Then you’ll learn about Airflow operators: Python and Bash operators. Finally, you’ll define the tasks that go into the DAGs for the example at hand.
In the last module, you’ll learn about Airbyte, an open-source data integration/movement platform that lets you connect more data sources and destinations with ease.
You’ll learn how to set up your environment and see how you can simplify the ELT process using Airbyte. To do so, you’ll modify the existing project’s components: ELT script and DAGs to integrate Airbyte into the workflow.
I hope you found this review of the free data engineering course helpful. I enjoyed the course—especially the hands-on approach to building and incrementally improving a data pipeline—instead of focusing on only theory. The code is also available for you to follow along. So, happy data engineering!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.