Saturday, July 18, 2015

How to LOAD 2 Billion Rows(500MB) Table from SQL SERVER to ORACLE Database with SSIS, Efficiently,(FAST, Manageability and Parallelism)

PROBLEM:
Currently, I am facing this issue, I have to design an ETL Process that will load a table with size of  2 BILLION ROWS into a Target Table in Oracle DB. Limitations: ETL tool is SSIS, 12hr network time block to run ETL per day after that connection will be lost and I will have to restart next day.



SOLUTION:



This is a no brainier, you have to split up that ETL Processes in logical divisions that can run in parallel. (DIVIDE AND CONQUER)
..but we only have one Source Table and one Target Table...how can I split it without having locking issues when loading to Target table?
     Answer, create Stage tables or temp tables in Oracle DB (I prefer physical tables because with this amount of data a temp table might get corrupted). These Stage tables will store the logically divided source data that you can use PARTITION EXCHANGE to load upload into Oracle Target Table.

STEP 2: HOW TO SPLIT UP for Source Table in LOGICAL DIVISONS for PARALLEL Processing

This is the tricky part, if you are not carefully, you might slow down your load. or get no benefit. But it is still better to SPLIT the ETL Process in multiple ones just so to reduce dependency and improve debugging.

Case 1: SOURCE AND TARGET TABLE HAVE SIMILAR PARTITION SCHEMES

Let's say your source table has been Partitioned properly for example Range Partition on a Date Column (pretty typical). Please review this link if you are not comfortable with partitioning in SQL Server Database https://msdn.microsoft.com/en-us/library/ms187802.aspx
For example: 2001- 2005 (5years) in first Partition, 2006- 2007 (in another partition), 2008-2009 (in the final partition). let's say our current year is 2008.

So we have 3 Partitions. for this 2billion record table.Then you in our Target Table it was designed to match the Source table definition( That is same Partition.scheme)

In this case, I would recommend splitting by partition key.
INSERT INTO TARGET TABLE (PARTITION 1) SELECT * FROM SOURCE TABLE (PARTITION 1)

INSERT INTO TARGET TABLE (PARTITION 2) SELECT * FROM SOURCE TABLE (PARTITION 2)

INSERT INTO TARGET TABLE (PARTITION 3) SELECT * FROM SOURCE TABLE (PARTITION 3)



NOTE: Key thing to consider, is  the Date_Column (or Partition key) has an index or is part of the primary key or can you specify the partition you want to SELECT FROM?  IF YES, go ahead and split the ETL process in parallel this way.

NOTE: While using SSIS, you will not have the option to do "INSERT INTO Target Table (PARTITION 1)" Attunity Destination does not support that. (will need to confirm). The work around is to Create some STAGE tables where you can insert each Source Table partitions then use partition exchange to insert the table data into the main table

For example

INSERT INTO STAGE_TABLE_1 SELECT * FROM SOURCE TABLE (PARTITION 1)

INSERT INTO STAGE_TABLE_2 SELECT * FROM SOURCE TABLE (PARTITION 2)

INSERT INTO STAGE_TABLE_3 SELECT * FROM SOURCE TABLE (PARTITION 3)

BENEFIT: After loading into this stage tables just use PARTITION Exchange to exchange the Stage table data into the appropriate partition of the target table. read about it here https://oracle-base.com/articles/misc/partitioning-an-existing-table-using-exchange-partition
it takes 1 secs if you do it 'WITHOUT VALIDATION'. while it may take up to 5 minutes for millions of records "WITH VALIDATION'

WARNING: You may not see much more performance gain if you split the ETL process up more, like the period form 2001-2005 can be split into 3. However, you will have to combine those 3 tables for the first partition into one before doing the exchange partition. This would take up some time and needs TESTING to ensure you are not losing more time combining STG tables than just Pulling that one partition of the source into TARGET.

To combine table use datapump export tables_exists_action=append.   to do it fast but it will still take some time.http://www.dba-oracle.com/t_table_exists_action_impdp.htm


CASE 2: SOURCE TABLE AND TARGET TABLE HAVE DIFFERENT PARTITION SCHEME

Source Table in this case has 3 partitions while Target table has 10 partitions. If your target table was partitioned by a date column we would use a where clause on the source table to channel the source data into the appropriate Target table partition.

Even if the source table does not have an index on that Date column or any column that you use as the partition key for your target table. It is still better to use that where clause to filter the data into the right partition.

for example

Insert into Target Table (partition 1) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) <2005

Insert into Target Table (partition 2) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2006

Insert into Target Table (partition 3) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2007

Insert into Target Table (partition 4) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2008


ETC.

SSIS PACKAGE DESIGN

I usually create a separate SSIS package for each package so that the process is independent of the other partitions. I implement similar error handling logic and other logging functionalities for each and every partition. The benefit of this is that you can run and stop each partition ETL process as much as you want while running it in parallel with other ETL processes for the other partitions. You can also deploy different partitions and run them in different servers to optimize performance without affecting each other.














Sunday, July 5, 2015

How to Load Bit datatype Columns from SQL Server Table to Oracle Table using SSIS

Scenario: I am loading a table that contains a bit datatype column from SQL Server into an Oracle Table with number datatype. However, when I load the data using SSIS, the 1's gets converted to a -1 in Oracle Number datatype.


Description: In SQL Server the bit datatype column contains values 0's or 1's but we don't have a corresponding bit Datatype in Oracle. Some Oracle table design could use NUMBER (1) or use a VARCHAR2(5) [for 'True'/'False']. For performance reasons better to use NUMBER(1) though.


Solution

In SSIS, the bit data type from SQL Server gets converted to 'TRUE' or 'FALSE' in SSIS data type before any other conversion. Since there is no boolean datatype in Oracle, you have to use SSIS Data Conversion transformation to first convert the Bit to Numeric Datatype. While in Numeric datatype conversion, for some reason SSIS converts the 1's to -1's and 0's to 0's. To resolve this mishap. Use the Derived Column Transformation. To ensure you are converting the bit correctly use Derived Column Transformation to convert the BIT into 0's, 1's, or NULL  depending on the data source value.

In this way your Oracle table will be sure to have those 0's or 1's from source.


Details

too come


Summary
 



Friday, July 3, 2015

How to Capture MilliSeconds in DateTime Columns in SQL Server and load into an Oracle Destination Table

Scenario; I want to transfer the data from a SQL Server Source table to an Oracle Destination table that has Date columns. I want to ensure, I copy all data (including milliseconds).


Description:
 Typically, the datatype used in SQL Server tables for Date columns is DateTime which has a format of 23:59:59.997. sed However in Oracle the corresponding DataType that DateTime maps to using SSIS is 1998/05/31:12:00:00AM which truncates the milliseconds.

Depending on your project objectives or requirements it might be necessary to capture these milliseconds.


Solution:
In Oracle Table make the Date column to use Timestamp Datatype
In SSIS use Data Converstion to convert the SQL Datetime column to String. Then String maps to the Timestamp Datatype in Attunity Oracle Destination.

Detailed Solution

Too come

Comments

BECOME A BIG DATA ENGINEER IN 3 MONTHS with less than $100 investment

INTRO: Below is my guide to becoming a data engineer based on the current job market (08/08/2020) demands. I have outlined the TOP 5 foun...