Data Engineer, ETL, ELT Problem solver

Tuesday, January 8, 2019

Performing SCD type 2 in Hive without using Merge statement

Introduction

Are you scratching your head trying to do a Merge statement in Hive? You know this is not the optimal way to use Hive yet you want to get the work done. Today is your lucky day.

I work for a company that unfortunately picked the wrong tool for routine analytics. Currently we use Hadoop Hive for our day to day Business Intelligence analytics requirements. We actually don't have more than 2 million rows of data across all our projects. Talk about under utilization.

One of the most recent request was to implement a type 2 slowly changing dimension (SCD) for a project that needed a historical trail. For a brief review of SCD can be found in Wikipedia here

Procedure

In this specific project, every day I extract records from an Oracle OLTP (online) database and store it into a Staging table. Then this Staging data is "Merged" into a Target table. By "Merge" I mean inserting new records and updating the "Valid to date" of the existing record

In most relational database we have an Update clause that can be used effectively to implement SCD 2. This helps us compare and combine two tables normally a stage and target table for example when matched

In this implementation of SCD type 2, we are creating 4 metadata columns to help us keep track of the records

UDC_Valid_FROM_DT : This field is used to store the date the ETL “last saw” the record from the source system. It denotes that the version of the record for that day the ETL was run. This date will help tell the story of the various version of a specific record (per business key). By default every new extraction from source will have a value of system date
UDC_Valid_TO_DT: This field is used to denote the “end” life span/ expiration date of a record (per business key). This will be used to show when a specific version of a record ceased to be valid or exist in the source system. By default every new extracted record from source will have a value ‘9999-12-31’, denoting a future date or infinity since that is the latest version of the record according to the source system
UDC_IsCurrent: This field is another way to represent the Valid_to_DT of ‘9999-12-31’ but it’s main purpose is to easily display which version of a record(per business key) is the latest according to the source system. By default, every new extract each day will have a value of ”1”. “0” will mean that the record is no longer the most up to date version of that record.
UDC_IsActive: This field is used to denote records(per business key) that are still existing in the source. It is not used to denote the most current version but rather to show that the business key could still be found in the source system since we know the source deletes records. By default, this field will be “1” for each extracted record. It will be “0” whenever the ETL notices that the business key does not exist in the source system

To implement this I created a staging table where I truncate and load the source extracted data into on a daily basis.

For the Target table we normally perform a Merge statement after the very first historical load.
The MERGE STATEMENT in any relational database to implement a SCD II is a routine and powerful SQL statement. However, in Hive we are limited to basic SQL statements.

In this blog, I will show you how I implemented this Merge statement in HIVE

Since my version of Hive does not even have an UPDATE functionality, I use the "CREATE table as" statement to recreate a new version of the table

NOTE: In this SCD II implementation we are only keeping the history when there is a change in certain key column (key_col) we care about or want to track from the source system.
pk_col means primary key column

select
t.col_1,t.key_col,t.pk_col,
t.udc_Valid_from_DT, case when t.udc_valid_to_DT ='9999-12-31' and s.pk_col is null and s1.pk_col is not null
THEN Date_add(from_unixtime(unix_timestamp()),0)
ELSE t.udc_valid_to_DT END AS udc_valid_to_DT
,case when t.udc_valid_to_DT ='9999-12-31' and s.pk_col is null and s1.pk_col is not null THEN 0
ELSE t.udc_iscurrent END as udc_iscurrent,
case when s1.pk_col is null then 0 Else 1 end udc_isactive
from target_table T
left join Source_stage_table S on T.pk_col = S.pk_col and T.key_col =s.key_col
left join Source_stage_table S1 on T.pk_col = S1.pk_col
union all
select
s.col_1
,s.key_col
,s.pk_col,
,TO_CHAR(SYSDATE, 'YYYY-MM-DD') udc_valid_from_dt,
'9999-12-31' udc_valid_TO_dt,
1 UDC_Iscurrent, 1 UDC_IsActive
from Source_stage_table s
where not exists (select 1 from target_table t where T.pk_col = S.pk_col and T.key_col =s.key_col
)

Wednesday, December 5, 2018

IBM DataStage Array Size and Record Count tuning

Hi

I was recently working on extracting a table that has 400 columns from a SQL Server DB to load into an Oracle Target Database. I designed my job set the array size to 50,000 with Record count 200,000. To my surprise the job could not pull the top 1000 records for over an hour.

Resolution

It turns out that based on my environment Datastage settings the buffer size is too small to hold an Array size of 50,000 records (considering that each row has 400 columns). I had to reduce it to 100 Array size so the job could run in 3 mins for 1million records

What is Array size: Array size is the number of rows that datastage reads/writes to the database. This is the number of records that is pushed from one stage to another (Pipeline parallelism)

Record count: is the the number of rows that datastage commits in the database after writing as much array size.

Datastage ETL loads Causing Oracle Database to hang up and crash(Oracle Database server would be inaccessible)

Lately, we faced an issue in my project where we noticed that when running 3 jobs that loaded tables of 3.5 Gb,1Gb,2gb at the same time choked the Oracle Database. The job were ran on 2 nodes, which turns out to be 3 x 2 is 6 sessions in parallel. The Dba noticed I/O waits on the Transaction log. This was due to the number of inserts being two much for the database which cause it to lock up.

We also noticed some strange Oracle sessions with SYS user performing UPDATEs on obj$. the Sessions kept growing after its intial appearance.

Another strange thing was the Oracle Dev Database worked well with the same amount of load, when ran in parallel.

SOLUTION
If you notice that the Oracle Database is hanging up when running Datastage jobs. Try to reduce the number of processes ran in Parallel. A rule of thumb is to run any table more than 3GB as a standalone then run the other two jobs (2gb and 1gb) in Parallel

Saturday, August 13, 2016

IBM DATASTAGE SLOW PERFORMANCE DUE TO DOUBLE DataType in Updates and Deletes, ORACLE CONNECTOR

Hi,

Are you facing extremely slow performance in Oracle Connector when you try to update a table in Oracle using column that has (NUMBER) data type as the Filtering condition (where clause)?

My friend, look no further this is a common problem me and my team faced using the latest Datastage version. This Datastage double data type even locks the table in addition to being extremely slow.

Why is this happening? In Oracle the NUMBER data type without any precision or scale is kind of an all purpose float datatype that can store any sorts of number from negative digits to numbers with lots decimal places.

This causes Datastage to interpret it as DOUBLE data type is pretty much a floating numeric datatype.

During Inserts you will not notice any performance hit
However during updates or Deletes where the filtering condition uses a column with double datatypes. You will notice how slow and nonfunctional it is..

To fix the situation we decided to remap the column definition of the Oracle table from Datastage Double datatype to Decimal (38,10). This way we are sure not to run into this issue. Converting it from a floating datatype to Decimal which is more precise boosts the performance and prevents locks on the table.

This can be easily done in the Table definition import by using Orchestrate plug in for Oracle instead of the wizard

Data Engineer, ETL, ELT Problem solver

Tuesday, January 8, 2019

Performing SCD type 2 in Hive without using Merge statement

Introduction

Wednesday, December 5, 2018

IBM DataStage Array Size and Record Count tuning

Datastage ETL loads Causing Oracle Database to hang up and crash(Oracle Database server would be inaccessible)

Saturday, August 13, 2016

IBM DATASTAGE SLOW PERFORMANCE DUE TO DOUBLE DataType in Updates and Deletes, ORACLE CONNECTOR

IBM DATASTAGE SLOW PERFORMANCE DUE TO DOUBLE DataType in Updates and Deletes, ORACLE CONNECTOR

Converting SQL Real Datatype to NUMBER in SSIS

Converting SQL Real Datatype to NUMBER in SSIS

BECOME A BIG DATA ENGINEER IN 3 MONTHS with less than $100 investment

Report Abuse