Dive into the world of big data with this course on data engineering. Learn about technologies like Hadoop, Spark, and cloud-based solutions, and gain experience in building data pipelines, processing large datasets, and implementing ETL (Extract, Transform, Load) processes. The course prepares students to work with data at scale, providing the tools needed for data analysis, reporting, and business intelligence.
Setup environment to learn SQL and Python essentials for Data Engineering.
Database essentials for Data Engineering using Postgres, such as creating tables, indexes, running SQL queries, and using important pre-defined functions.
Data Engineering programming essentials using Python, including basic programming constructs, collections, Pandas, and database programming.
Data Engineering using Spark DataFrame APIs (Spark) with Databricks. Learn important Spark DataFrame APIs such as select, filter, groupBy, orderBy, etc.
Data Engineering using Spark SQL (Spark and Spark SQL). Learn how to write high-quality Spark SQL queries using SELECT, WHERE, GROUP BY, ORDER BY, etc.
Understand the relevance of Spark Metastore and the integration of DataFrames and Spark SQL.
Build Data Engineering Pipelines using Spark, leveraging Python as the programming language.
Work with different file formats such as Parquet, JSON, and CSV in building Data Engineering Pipelines.
Setup Hadoop and Spark Cluster on GCP using Dataproc.
Understand the complete Spark application development lifecycle to build Spark applications using PySpark and review applications using Spark UI.