Data Engineer, ETL, ELT Problem solver: How to LOAD 2 Billion Rows(500MB) Table from SQL SERVER to ORACLE Database with SSIS, Efficiently,(FAST, Manageability and Parallelism)

PROBLEM:
Currently, I am facing this issue, I have to design an ETL Process that will load a table with size of 2 BILLION ROWS into a Target Table in Oracle DB. Limitations: ETL tool is SSIS, 12hr network time block to run ETL per day after that connection will be lost and I will have to restart next day.

SOLUTION:

This is a no brainier, you have to split up that ETL Processes in logical divisions that can run in parallel. (DIVIDE AND CONQUER)
..but we only have one Source Table and one Target Table...how can I split it without having locking issues when loading to Target table?
Answer, create Stage tables or temp tables in Oracle DB (I prefer physical tables because with this amount of data a temp table might get corrupted). These Stage tables will store the logically divided source data that you can use PARTITION EXCHANGE to load upload into Oracle Target Table.

STEP 2: HOW TO SPLIT UP for Source Table in LOGICAL DIVISONS for PARALLEL Processing

This is the tricky part, if you are not carefully, you might slow down your load. or get no benefit. But it is still better to SPLIT the ETL Process in multiple ones just so to reduce dependency and improve debugging.

Case 1: SOURCE AND TARGET TABLE HAVE SIMILAR PARTITION SCHEMES

Let's say your source table has been Partitioned properly for example Range Partition on a Date Column (pretty typical). Please review this link if you are not comfortable with partitioning in SQL Server Database https://msdn.microsoft.com/en-us/library/ms187802.aspx
For example: 2001- 2005 (5years) in first Partition, 2006- 2007 (in another partition), 2008-2009 (in the final partition). let's say our current year is 2008.

So we have 3 Partitions. for this 2billion record table.Then you in our Target Table it was designed to match the Source table definition( That is same Partition.scheme)

In this case, I would recommend splitting by partition key.
INSERT INTO TARGET TABLE (PARTITION 1) SELECT * FROM SOURCE TABLE (PARTITION 1)

INSERT INTO TARGET TABLE (PARTITION 2) SELECT * FROM SOURCE TABLE (PARTITION 2)

INSERT INTO TARGET TABLE (PARTITION 3) SELECT * FROM SOURCE TABLE (PARTITION 3)

NOTE: Key thing to consider, is the Date_Column (or Partition key) has an index or is part of the primary key or can you specify the partition you want to SELECT FROM? IF YES, go ahead and split the ETL process in parallel this way.

NOTE: While using SSIS, you will not have the option to do "INSERT INTO Target Table (PARTITION 1)" Attunity Destination does not support that. (will need to confirm). The work around is to Create some STAGE tables where you can insert each Source Table partitions then use partition exchange to insert the table data into the main table

For example

INSERT INTO STAGE_TABLE_1 SELECT * FROM SOURCE TABLE (PARTITION 1)

INSERT INTO STAGE_TABLE_2 SELECT * FROM SOURCE TABLE (PARTITION 2)

INSERT INTO STAGE_TABLE_3 SELECT * FROM SOURCE TABLE (PARTITION 3)

BENEFIT: After loading into this stage tables just use PARTITION Exchange to exchange the Stage table data into the appropriate partition of the target table. read about it here https://oracle-base.com/articles/misc/partitioning-an-existing-table-using-exchange-partition
it takes 1 secs if you do it 'WITHOUT VALIDATION'. while it may take up to 5 minutes for millions of records "WITH VALIDATION'

WARNING: You may not see much more performance gain if you split the ETL process up more, like the period form 2001-2005 can be split into 3. However, you will have to combine those 3 tables for the first partition into one before doing the exchange partition. This would take up some time and needs TESTING to ensure you are not losing more time combining STG tables than just Pulling that one partition of the source into TARGET.

To combine table use datapump export tables_exists_action=append. to do it fast but it will still take some time.http://www.dba-oracle.com/t_table_exists_action_impdp.htm

CASE 2: SOURCE TABLE AND TARGET TABLE HAVE DIFFERENT PARTITION SCHEME

Source Table in this case has 3 partitions while Target table has 10 partitions. If your target table was partitioned by a date column we would use a where clause on the source table to channel the source data into the appropriate Target table partition.

Even if the source table does not have an index on that Date column or any column that you use as the partition key for your target table. It is still better to use that where clause to filter the data into the right partition.

for example

Insert into Target Table (partition 1) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) <2005

Insert into Target Table (partition 2) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2006

Insert into Target Table (partition 3) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2007

Insert into Target Table (partition 4) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2008

ETC.

SSIS PACKAGE DESIGN

I usually create a separate SSIS package for each package so that the process is independent of the other partitions. I implement similar error handling logic and other logging functionalities for each and every partition. The benefit of this is that you can run and stop each partition ETL process as much as you want while running it in parallel with other ETL processes for the other partitions. You can also deploy different partitions and run them in different servers to optimize performance without affecting each other.

Data Engineer, ETL, ELT Problem solver

Saturday, July 18, 2015

How to LOAD 2 Billion Rows(500MB) Table from SQL SERVER to ORACLE Database with SSIS, Efficiently,(FAST, Manageability and Parallelism)

Case 1: SOURCE AND TARGET TABLE HAVE SIMILAR PARTITION SCHEMES

No comments:

Post a Comment

BECOME A BIG DATA ENGINEER IN 3 MONTHS with less than $100 investment

Report Abuse