Welcome to the Data Engineering Nanodegree Program
Pre-requisites
How to Succeed3:32
Access the Career Portal
How Do I Find Time for My Nanodegree?
Introduction to Data Engineering
What do Data Engineers do?1:44
What do Data Engineers do? 26:02
A Brief History
Data Engineering Tools
Introduction to Data Modeling
02. What is Data Modeling?4:20
Test
03. Why is Data Modeling Important?1:44
Test
04. Who does this type of work?0:28
05. Intro to Relational Databases3:43
Test
06. Relational Databases
07. When to use a relational database?3:17
08. ACID Transactions3:46
08. Test
09. When Not to Use a Relational Database2:31
09. Test
10. What is PostgreSQL?0:53
11. Demos: Creating a Postgres Table10:57
12. Exercise 1: Creating a Table with Postgres
13. Solution for Exercise 1: Create a Table with Postgres4:52
14. NoSQL Databases4:30
14. Test
15. What is Apache Cassandra?0:31
15. Test
16. When to Use a NoSql Database3:03
17. When Not to Use a NoSql Database2:33
18. Demo 2: Creating table with Cassandra7:37
19. Exercise 2: Create table with Cassandra
20. Solution for Exercise 2: Create table with Cassandra4:55
21. Conclusion0:31
Relational Data Models
02. Databases0:41
03. Importance of Relational Databases2:15
04. OLAP vs OLTP1:13
05. Quiz 1
06. Structuring the Database: Normalization2:58
07. Objectives of Normal Form1:54
08. Quiz
08. Normal Forms6:48
09. Demo 1: Creating Normalized Tables11:52
10. Exercise 1: Creating Normalized Tables
12. Denormalization3:56
12. Test
13. Demo 2: Creating Denormalized Tables00:00
14. Denormalization Vs. Normalization
15. Exercise 2: Creating Denormalized Tables
17. Fact and Dimension Tables2:46
18. Star Schema0:37
19. Benefits of Star Schemas00:00
20. Snowflake Schemas0:56
21. Demo 3: Creating Fact and Dimension Tables00:00
22. Exercise 3: Creating Fact and Dimension Tables
24. Data Definition and Constraints
25. Upsert
26. Conclusion1:06
Project Data Modeling with Postgres
Project Datasets
Project Instructions
Project Description – Data Modeling with Postgres
Project Rubric – Data Modeling with Postgres
NoSQL Data Models
02. Non-Relational Databases2:50
03. Distributed Databases2:33
04. CAP Theorem1:27
05. Quiz 1
06. Denormalization in Apache Cassandra5:14
06. Quiz
07. CQL1:01
08. Demo 15:40
09. Exercise 1
10. Exercise 1 Solution
11. Primary Key3:33
12. Primary Key
13. Demo 24:24
14. Exercise 2
15. Exercise 2: Solution
16. Clustering Columns00:00
16. Quiz
17. Demo 300:00
18. Exercise 3
19. Exercise 3: Solution
20. WHERE Clause00:00
20. Quiz
21. Demo 45:08
22. Exercise 4
22.1 Solution
23. Lesson Wrap Up00:00
Course Wrap Up00:00
Project Data Modeling with Apache Cassandra
Introduction
Project Details
Project Workspace
Project Description – Data Modeling with Apache Cassandra
Project Rubric – Data Modeling with Apache Cassandra
Introduction to Data Warehouses
01. Course Introduction0:40
02. Lesson Introduction1:06
04. Operational vs. Analytical Processes3:10
04. Quiz
05. Data Warehouse: Technical Perspective4:20
03. Data Warehouse: Business Perspective3:30
06. Dimensional Modeling3:28
06. Quiz
07. ETL Demo: Step 1 & 23:45
08. Exercise 1: Step 1 & 200:00
09. ETL Demo: Step 34:16
10. Exercise 1: Step 300:00
11. ETL Demo: Step 41:49
12. Exercise 1: Step 400:00
13. ETL Demo: Step 56:05
14. Exercise 1: Step 500:00
15. ETL Demo: Step 600:00
16. Exercise 1: Step 600:00
17. DWH Architecture: Kimball’s Bus Architecture00:00
17. Quiz
18. DWH Architecture: Independent Data Marts00:00
18. Quiz
19. DWH Architecture: CIF00:00
19. Quiz
20. DWH Architecture: Hybrid Bus & CIF00:00
20. Quiz
21. OLAP Cubes00:00
22. OLAP Cubes: Roll-Up and Drill Down00:00
23. OLAP Cubes: Slice and Dice00:00
23. Quiz
24. OLAP Cubes: Query Optimization00:00
25. OLAP Cubes Demo: Slicing & Dicing00:00
26. Exercise 2: Slicing & Dicing
27. OLAP Cubes Demo: Roll-Up00:00
28. Exercise 2: Roll-Up & Drill Down
29. OLAP Cubes Demo: Grouping Sets00:00
30. Exercise 2: Grouping Sets
31. OLAP Cubes Demo: CUBE00:00
32. Exercise 2: CUBE
33. Data Warehouse Technologies00:00
34. Demo: Column format in ROLA
35. Exercise 3: Column format in ROLAP
Introduction to Cloud Computing and AWS
01. Lesson Introduction
02. Cloud Computing00:00
03. Amazon Web Services1:02
04. AWS Setup Instructions
05. Create an IAM Role
06. Create Security Group
07. Launch a Redshift Cluster
08. Create an IAM User
09. Delete a Redshift Cluster
10. Create an S3 Bucket
11. Upload to S3 Bucket
12. Create PostgreSQL RDS
13. Avoid Paying Unexpected Costs for AWS
Implementing Data Warehouses on AWS
01. Lesson Introduction1:32
Data Warehouse: A Closer Look00:00
03. Choices for Implementing a Data Warehouse00:00
03. Quiz
04. DWH Dimensional Model Storage on AWS00:00
05. Amazon Redshift Technology00:00
05. Quiz
06. Amazon Redshift Architecture00:00
06. Quiz
07. Redshift Architecture Example00:00
08. SQL to SQL ETL00:00
08. Quiz
09. SQL to SQL ETL – AWS Case00:00
09. Quiz
10. Redshift & ETL in Context00:00
10. Quiz
11. Ingesting at Scale00:00
11. Quiz
12. Redshift ETL Examples2:19
12. Quiz
13. Redshift ETL Continued3:00
13. Quiz
14. Redshift Cluster Quick Launcher1:37
Exercise 1: Launch Redshift Cluster
16. Problems with the Quick Launcher2:20
17. Infrastructure as Code on AWS2:59
17. Quiz
18. Enabling Programmatic Access fo IaC2:19
19. Demo: Infrastructure as Code2:27
20. Exercise 2: Infrastructure as Code
21. Exercise Solution 2: Infrastructure as Code
22. Demo: Parallel ETL00:00
23. Exercise 3: Parallel ETL
24. Exercise Solution 3: Parallel ETL
25. Optimizing Table Design00:00
26. Distribution Style: Even00:00
26. Quiz
27. Distribution Style: All00:00
27. Quiz
28. Distribution Syle: Auto00:00
29. Distribution Syle: Key00:00
29. Quiz
30. Sorting Key00:00
Sorting Key Example00:00
32. Demo: Table Design00:00
33. Exercise 4: Table Design
34. Exercise Solution 4: Table Design
35. Conclusion00:00
Project: Data Warehouse
Introduction
Project Details
Project Instructions
Environment
Project Description – Data Warehouse
Project Rubric – Data Warehouse
The Power of Spark
01. Introduction1:25
02. What is Big Data?1:24
02. Quiz
03. Numbers Everyone Should Know00:00
03. Quiz
04. Hardware: CPU00:00
04. Quiz
05. Hardware: Memory00:00
Hardware Memory 200:00
05. Quiz
06. Hardware: Storage00:00
07. Hardware: Network00:00
07. Quiz
08. Hardware: Key Ratios00:00
09. Small Data Numbers00:00
10. Big Data Numbers00:00
10.Quiz
Big Data Numbers Part 200:00
11. Medium Data Numbers00:00
12. History of Distributed Computing00:00
12. Quiz
13. The Hadoop Ecosystem00:00
14. MapReduce3:00
14. Quiz
15. Hadoop MapReduce [Demo]
16. The Spark Cluster2:22
16. Quiz
17. Spark Use Cases1:30
18. Summary00:00
Data Wrangling with Spark
01. Introduction1:12
02. Functional Programming1:40
03. Why Use Functional Programming1:19
03. Quiz
04. Procedural Example1:24
05. Procedural [Example Code]
06. Pure Functions in the Bread Factory2:31
07. The Spark DAGs: Recipe for Data2:14
08. Maps and Lambda Functions3:37
09. Maps and Lambda Functions [Example Code]
10. Data Formats2:22
11. Distributed Data Stores1:15
12. SparkSession1:17
13. Reading and Writing Data into Spark Data Frames3:57
15. Imperative vs Declarative programming00:00
16. Data Wrangling with DataFrames00:00
17. Data Wrangling with DataFrames Extra Tips
18. Data Wrangling with Spark [Example Code]
19. Quiz – Data Wrangling with DataFrames
19. Quiz
20. Quiz – Data Wrangling with DataFrames Jupyter Notebook
21. Quiz [Solution Code]
22. Spark SQL0:56
23. Example Spark SQL2:12
24. Example Spark SQL [Example Code]
25. Quiz – Data Wrangling with SparkSQL
26. Quiz [Spark SQL Solution Code]
27. RDDs00:00
28. Summary00:00
Debugging and Optimization
01. Introduction00:00
02. Setup Instructions AWS00:00
03. From Local to Standalone Mode00:00
04. Spark Scripts00:00
05. Submitting Spark Scripts10:53
06. Storing and Retrieving Data on the Cloud0:56
07. Reading and Writing to Amazon S3 Part 11:52
07. Reading and Writing to Amazon S3 Part 200:00
07. Reading And Writing To Amazon S3 Part 300:00
08. Introduction to HDFS00:00
09. Reading and Writing Data to HDFS00:00
10. Recap Local Mode to Cluster Mode00:00
11. Debugging is Hard00:00
12. Syntax Errors00:00
13. Code Errors00:00
14. Data Errors00:00
15. Data Errors00:00
16. Debugging your Code00:00
17. How to Use Accumulators00:00
18. Spark WebU00:00
19. Connecting to the Spark Web UI00:00
20. Getting Familiar with the Spark UI00:00
21. Review of the Log Data00:00
22. Diagnosing Errors Part I00:00
23. Diagnosing Errors Part 200:00
24. Diagnosing Errors Part 300:00
25. Optimization Introduction00:00
26. Understanding Data Skew00:00
27. Understanding Big O Complexity00:00
28. Other Issues and How to Address Them
29. Lesson Summary00:00
Introduction to Data Lakes
01. Introduction00:00
02. Lesson Overview00:00
03. Why Data Lakes: Evolution of the Data Warehouse00:00
04. Why Data Lakes: Unstructured & Big Data00:00
05. Why Data Lakes: New Roles & Advanced Analytics00:00
06. Big Data Effects: Low Costs, ETL Offloading1:36
07. Big Data Effects: Schema-on-Read3:23
08. Big Data Effects: (Un-/Semi-)Structured support2:41
09. Demo: Schema On Read Pt 12:43
10. Demo: Schema On Read Pt 22:54
11. Demo: Schema On Read Pt 300:00
12. Demo: Schema On Read Pt 400:00
13. Exercise 1: Schema On Read
14. Demo: Advanced Analytics NLP Pt 100:00
15. Demo: Advanced Analytics NLP Pt 22:25
16. Demo: Advanced Analytics NLP Pt 300:00
17. Exercise 2: Advanced Analytics NLP
18. Data Lake Implementation Introduction00:00
19. Data Lake Concepts00:00
20. Data Lake vs Data Warehouse00:00
21. AWS Setup
22. Data Lake Options on AWS00:00
23. AWS Options: EMR (HDFS + Spark)1:50
24. AWS Options: EMR: S3 + Spark3:17
25. AWS Options: Athena2:13
26. Demo: Data Lake on S3 Pt 13:23
27. Demo: Data Lake on S3 Pt 200:00
28. Exercise 3: Data Lake on S3
29. Demo: Data Lake on EMR Pt 100:00
30. Demo: Data Lake on EMR Pt 200:00
31. Demo: Data Lake on Athena Pt 13:53
32. Demo: Data Lake on Athena Pt 22:49
33. Data Lake Issues00:00
34. [AWS] Launch EMR Cluster and Notebook
35. [AWS] Avoid Paying Unexpected Costs
35. [AWS] Avoid Paying Unexpected Costs
Project: Data Lake
Project Introduction
Project Datasets
Project Instructions
Project Description – Data Lake
Project Rubric – Data Lake
Data Pipeline
01. Welcome1:15
03. What is a Data Pipeline?2:14
03. Quiz
03.1 Install Apache Airflow on Windows using Windows Subsystem for Linux (WSL)15:20
03.2 Install Apache Airflow on MacOS7:32
04. Data Validation2:00
04. Quiz
05. DAGs and Data Pipelines3:25
05. Quiz
06. Bikeshare DAG1:23
06. Quiz
07. Introduction to Apache Airflow2:11
08. Demo 1: Airflow DAGs8:23
09. Workspace Instructions
10. Exercise 1: Airflow DAGs
11. Solution 1: Airflow DAGs1:26
12. How Airflow Works00:00
13. Airflow Runtime Architecture
13. Quiz
14. Building a Data Pipeline00:00
15. Demo 2: Run the Schedules00:00
16. Exercise 2: Run the Schedules
17. Solution 2: Run the Schedules00:00
18. Operators and Tasks2:48
19. Demo 3: Task Dependencies00:00
20. Exercise 3: Task Dependencies
21. Solution: Task Dependencies00:00
22. Airflow Hooks00:00
23. Demo 4: Connections and Hooks00:00
24. Exercise 4: Connections and Hooks
25. Solution 4: Connections and Hooks
26. Demo 5: Context and Templating00:00
27. Exercise 5: Context and Templating
28. Solution 5: Context and Templating00:00
29. Quiz: Review of Pipeline Components
29. Quiz:
30. Demo: Exercise 6: Building the S3 to Redshift DAG7:07
31. Exercise 6: Build the S3 to Redshift DAG
32. Solution 6: Build the S3 to Redshift DAG
33. Conclusion00:00
Data Quality
01. What we are going to learn?0:36
02. What is Data Lineage?00:00
03. Visualizing Data Lineage2:15
03. Quiz
04. Demo 1: Data Lineage in Airflow5:11
05. Exercise 1: Data Lineage in Airflow
06. Solution 1: Data Lineage in Airflow00:00
07. Data Pipeline Schedules00:00
08. Scheduling in Airflow00:00
08. Quiz
09. Updating DAGs00:00
09. Updating DAGs 200:00
10. Demo 2: Schedules and Backfills in Airflow00:00
11. Exercise 2: Schedules and Backfills in Airflow
12. Solution 2: : Schedules and Backfills in Airflow00:00
13. Data Partitioning00:00
14. Goals of Data Partitioning00:00
14. Quiz
15. Demo 3: Data Partitioning00:00
16. Exercise 3: Data Partitioning
17. Solution 3: Data Partitioning00:00
18. Data Quality00:00
18. Quiz
19. Demo 4: Data Quality00:00
20. Exercise 4: Data Quality
21. Solution 4: Data Quality
22. Conclusion00:00
Production Data Pipelines
01. Lesson Introduction00:00
02. Extending Airflow with Plugins00:00
03. Extending Airflow Hooks & Contrib00:00
04. Demo 1: Operator Plugins00:00
05. Exercise 1: Operator Plugins
06. Solution 1: Operator Plugins00:00
07. Best Practices for Data Pipeline Steps – Task Boundaries00:00
08. Demo 2: Task Boundaries7:25
09. Exercise 2: Refactor a DAG
10. Solution 2: Refactor a DAG00:00
11. Subdags: Introduction and When to Use Them00:00
12. SubDAGs: Drawbacks of SubDAGs00:00
13. Quiz: Subdags
13. Quiz
14. Demo 3: SubDAGs00:00
15. Exercise 3: SubDAGs
16. Solution 3: Subdags00:00
17. Monitoring00:00
18. Monitoring
18. Quiz
19. Exercise 4: Building a Full DAG
20. Solution 4: Building a Full Pipeline
21. Conclusion00:00
22. Additional Resources: Data Pipeline Orchestrators
Project Data Pipelines
Project Introduction
Project Overview
Add Airflow Connections to AWS
Project Instructions
Workspace Instructions
Project Workspace
Project Description – Data Pipelines
Project Rubric – Data Pipelines
Take 30 Min to Improve your LinkedIn
Get Opportunities with LinkedIn2:01
Use Your Story to Stand Out3:00
Why Use an Elevator Pitch00:00
Create Your Elevator Pitch00:00
Pitching to a Recruiter00:00
Use Your Elevator Pitch on LinkedIn
06. Create Your Profile With SEO In Mind
07. Profile Essentials
08. Work Experiences & Accomplishments
09. Build and Strengthen Your Network
10. Reaching Out on LinkedIn
11. Boost Your Visibility
12. Up Next
Project Description – Improve Your LinkedIn Profile
Project Rubric – Improve Your LinkedIn Profile
Capstone Project
Project Instructions
Project Resources
Project Description – Data Engineering Capstone Project
Project Rubric – Data Engineering Capstone Project
Job Search
Intro00:00
Job Search Mindset00:00
Target Your Application to An Employer00:00
Open Yourself Up to Opportunity0:24
Refine Your Entry-Level Resume
Convey Your Skills Concisely1:23
Effective Resume Components1:36
Resume Structure2:12
Describe Your Work Experiences1:09
Resume Reflection00:00
Craft Your Cover Letter
Get an Interview with a Cover Letter!1:39
Purpose of the Cover Letter1:10
Cover Letter Components0:54
Write the Introduction1:34
Write the Body00:00
Write the Conclusion00:00
Format00:00
Optimize Your GitHub Profile
Introduction00:00
GitHub profile important items00:00
Good GitHub repository00:00
Interview Part 100:00
Identify fixes for example “bad” profile00:00
Identify fixes for example “bad” profile 200:00
Quick Fixes #100:00
Quick Fixes #200:00
Writing READMEs00:00
Interview Part 200:00
Commit messages best practices
Reflect on your commit messages00:00
Participating in open source projects00:00
Starring interesting repositories00:00
Develop Your Personal Brand
Why Network?00:00
Why Use Elevator Pitches?00:00
Personal Branding
Elevator Pitch00:00
Pitching to a Recruiter00:00
Use Your Elevator Pitch00:00
Project: Data Modeling with Postgres
Introduction
A startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don’t have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
They’d like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis, and bring you on the project. Your role is to create a database schema and ETL pipeline for this analysis. You’ll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
Project Description
In this project, you’ll apply what you’ve learned on data modeling with Postgres and build an ETL pipeline using Python. To complete the project, you will need to define fact and dimension tables for a star schema for a particular analytic focus, and write an ETL pipeline that transfers data from files in two local directories into these tables in Postgres using Python and SQL.