Data Engineer, ETL, ELT Problem solver: January 2020

Thursday, January 16, 2020

A Cost Effective Modern Data Analytics architecture. Go from Datalake file to PowerBI/Tableau using Databricks/Spark

This article describes one of the most cost-efficient and simple architectures that will satisfy business needs without data duplication

Business Case:
A common scenario in modern data warehousing/analytics is that certain Business units want to leverage data sitting on the data lake for their visualizations and analytics. They want to avoid pulling the data from the data lake into a local instance on Prem SQL Database. They have problems finding ways to connect their visualization tools directly to data stored in the Data lake. They also want to have a cloud-first approach that aligns with the companies initiative and goals. They want it cheap, and flexible.

Why is this cheap? You only pay for Databricks cluster On-time and DataLake file storage. Eliminate the need for any other tool like SQL Server, Azure SQL DataWarehouse, etc.

Example (Azure environment)
I have sales reports that were sent by a 3rd party to a file share. I want to process this data, combine it with some reference data from my product catalog system and visualize it in PowerBI or Tableau. So far, I am able to ingest the data with DataFactory, transform it with Databricks and store the transformed data into the Data Lake (Azure Data Lake Store Gen2). Now I want to create visualizations using this data with PowerBI or Tableau.

Challenge: PowerBI or Tableau does not support connectors to Azure Data Lake Store for various file types like parquet, orc. PowerBI can only connect to CSV files. This becomes a challenging scenario for a Data Analyst.
Old Solution: Push the data from Data Lake Store to SQL Server or using a different tool like Dremio to connect. These are not terrible solutions, however, they have disadvantages.

SOLUTION

Below is the high-level architecture

Reference: https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse

The above architecture describes how you can connect PowerBI directly to your Data lake by using Spark Connectors from Databricks.

In other to accomplish this, you have to create a Spark Database, and Tables in your Databricks cluster using the concept of the External table.

For illustration, let's assume I have transformed and loaded my data in parquet file format to the Datalake (ADLS) using spark write dataframe API.

Now, I can use the below Spark SQL query to create an External table over this parquet file and this table schema and metadata information will be persisted in my cluster's default database, as long as the cluster is ON.

Query:

%sql

CREATE TABLE cleaned_taxes
  USING parquet
  LOCATION '/mnt/delta/cleaned_taxes'

In the above query sample. I am creating the external table over my 'cleaned_taxes' dataset. Please refer to this article for more details https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

Please note that this external table concept will work for any dataset that is supported by spark writer api (csv, orc, json, txt). Refer to this article for more detailshttps://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataFrameWriter.html

You can confirm that the Spark table is created in your cluster in the database section

Once this is done, you have readied your file (cleaned_taxes) for connection to a visualization tool using Spark Connector.

This next step is well documented in various blogs on how you can connect your PowerBI or Tableau to Databricks

For Tableau to Databricks see this link https://docs.databricks.com/bi/tableau.html

PowerBI to Databricks see this link https://docs.microsoft.com/en-us/azure/databricks/bi/power-bi
or
https://towardsdatascience.com/visualizing-data-with-azure-databricks-and-power-bi-desktop-845b0e317dc6

Things to Note
1. Your Databricks cluster has to be ON throughout this process to ensure connectivity from the Tableau or PowerBi.
2. Direct Query or Data Import modes are supported. I always prefer importing the data as long as it is able to fit into the Reporting server.
3. For Dashboard or Report scheduled data refresh, you need the databricks cluster to be ON. This is accomplished by starting the cluster 10 mins before the Schedule report refresh time and giving it the required time (20mins to 1hr time) for a refresh. You can synchronize this by setting up a Databricks job on your cluster that runs before your reports refresh the data. The job could be a simple notebook that just waits for 30 mins and does nothing.

For additional information see article https://docs.databricks.com/clusters/clusters-manage.html

Example:

Create a notebook with the following python code that just waits for 60secs

import time
time.sleep6(60)   # Delays for 5 seconds. You can also use a float value.

Reference: https://stackoverflow.com/questions/510348/how-can-i-make-a-time-delay-in-python

Now create a job to fire this notebook 15 mins (Databricks cluster normally takes 5-10 mins to start) before your scheduled Report refresh time. This ensures the databricks cluster is warmed up and ready for JDBC calls from the dashboard report server

For more information on creating jobs in databricks see link https://docs.databricks.com/jobs.html#job-create

Disclaimer: This is just a conceptual illustration, not a demo with exact code. Please formulate your own logic and code

How to grow into a Data Engineer from an ETL, DBA, Analyst Background at no cost

This post is meant to highlight the core skills needed to be developed for anyone that is interested to be a data engineer. I have added some reference material that I actually used for my studies.

The below are ranked in terms of priority

1. Advanced SQL query knowledge
https://youtu.be/HXV3zeQKqGY
  https://youtu.be/2Fn0WAyZV0E
2. Intermediate to Advanced understanding of Relational Databases
https://youtu.be/ztHopE5Wnpc
3. Intermediate understanding of Data Modelling (Star Schema, Snowflake schema)
https://youtu.be/tR_rOJPiEXc
https://youtu.be/lWPiSZf7-uQ
4. Extract Transform Load basics
https://youtu.be/7MOU1l30lXs
5. Data Warehousing
https://youtu.be/lWPiSZf7-uQ
https://youtu.be/CHYPF7jxlik

The above list most ETL, DBA or Business Analysts like me should have this already.

Additional Core Skills for Data Engineering (I had to learn these)
1. Deep understanding of the fundamental of any Big Data or Distributed Systems.
https://youtu.be/tpspO9K28PM
https://youtu.be/Y6Ev8GIlbxc
2. Apache Spark architecture and programming in Spark
https://youtu.be/CF5Ewk0GxiQ
https://youtu.be/GFC2gOL1p9k
https://youtu.be/dq73Ghk3MQg
Note: that the above videos might not be comprehensive feel free to go deeper. Also note that RDD API is no more in common use, rather focus on Spark SQL, PYSpark or Scala API's
3. Python programming intermediate level
  https://youtu.be/rfscVS0vtbw
  https://youtu.be/mkv5mxYu0Wk (Python for Datascience, )
  https://youtu.be/vmEHCJofslg (learn Pandas library)
  https://www.youtube.com/watch?v=K8L6KVGG-7o
4. Cloud computing basics (Azure, AWS) fundamentals
https://www.youtube.com/playlist?list=PL-V4YVm6AmwWLTTwZdI7hcpKqTpFUIKUE (Azure)
5. Hadoop Distributed Files System Architecture
https://youtu.be/pY0Wgbe712o

6. Big Data File Formats
https://youtu.be/UXhyENkYokw
https://youtu.be/rVC9F1y38oU
7. Hive
  https://youtu.be/AcpGl0TQIRM
8. Optimization Techniques for all the above systems or topics

Additional Skills that are also needed but not a priority
1. Kafka and streaming tools like (Spark Streaming)
2. NO SQL Databases
3. Continuous Integration and Continuous Development coding method
4. Data Lake basics
5. Cloud ETL tools
6. Graph Databases
7. Machine Learning
8. Microservices
https://youtu.be/j1gU2oGFayY
9. Map Reduce
10. Unix file system scripts basics
11. Regex

Wednesday, January 15, 2020

Common Data Engineer Interview Questions

Azure cloud

1. What is the difference between Azure Data Lake Gen 1 and Gen 2"
2. Types of roles in DataFactory and their differences?
3. What is needed in Datafactory to copy data from On premise Database to Azure storage (ADLS)
3. What are the Integration runtime details and the types?
4. How and different ways to connect to Azure Data Lake to Databricks?

Distributed Systems Open Source

Spark
1. Explain your understanding of Spark architecture?
2. What are Broadcast and Accumulator variables?
3. Difference between Spark cluster mode and Client mode?
4. Difference between Checkpoint and Cache
5. Types of Caching in Spark, Cache vs Persist?
6. Types of transformations in Spark
7. What are Spark Jobs, Stages, Tasks and their differences?
8. What defines the number of stages that can be created in a spark job?
9. What is Delta Lake?
10. What is Vaccum in Delta lake and time travel?
11. Spark optimization techniques for large table joins and skewed data?
12. Explain spark partitioning or how is parallelism achieved in spark?
13. How does spark ensure fault tolerance
14. Bucketing in spark, Partition pruning,
15. What is spark lazy execution

Hive
1. What is H Catalog in Hive?

File Formats
1. Advantages of parquet files
2. Types of parquet files

Tuesday, January 14, 2020

Handling PowerBI csv parsing for fields with double quotes (")

We have a requirement to create PowerBI Dashboards from Azure Data Lake Store (ADLS Gen2) csv files.

However, when importing data in PowerBI some columns were misaligned because PowerBI was parsing double-quote characters are field delimiters inappropriately.

Solution

In your ETL replace all double quote characters to two single quote characters.

Data Engineer, ETL, ELT Problem solver