Sunday, June 28, 2015

Best way in Loading 460 millions of LOB data (varchar >4000)/varchar(max) in 1.7billion records table from SQL Server to Oracle Table LOB columns, fast using SSIS.

Scenario: I want to load data from a SQL Server table that has 1.7billion rows containing a column with data type varchar(max) which is populated 460 million times into an Oracle Table. ETL tool is SSIS

SOLUTION

--Note this method only works if your source Table has a unique key that can be used to update the Target table Later

Overview, The method here in tackling this problem is basically DIVIDE and CONQUER. You want to separate your ETL to perform the load for records where the Varchar(max) column will contain less than 4000 characters then have a separate process to Load the Source Data for Varchar(max) greater 4000 characters, which should be a few subset of records hopefully.

Then after loading these to Stage Tables in your Oracle database, I would use Merge Statement to update the column values for that LOB column using the different tables that I loaded. This merge Statement might take time but it is the better option than using SSIS to load directly into a CLOB field

Before doing this guidelines outlined here. I would say investigate your source table data. Use the LEN function to check how many rows are having characters more than 4000 and how many less than 4000. In this way you will save time and know if this method will work for you.

For example: SELECT  count(*) FROM Source_Table where len(varchar_max_col) <=4000


STEP 1: PARALLEL PROCESSING and BATCH PROCESSING
Please review link on parallel processing for ETL Design
http://plsqlandssisheadaches.blogspot.com/2015/07/how-to-load-2-billion-rows2-terabyte.html

STEP 2: Method in loading Table with Varchar (max) into Oracle CLOB column fast
1. Create ETL Process to load all the columns of source into  target table except the LOB column. Load all the columns expect the "VARCHAR(MAX)" column. In other words, omit Selecting the Varchar(max) column from source table and let it be nullable in your target table, while you load the rest of the source table data into the target table. USE Microsoft Connector to Oracle by Attunity Drivers, in FAST Direct path MODE , and Parallel .for this SSIS ETL step

2. Create a second ETL Process just to load the Source Table data that has 4000 Characters or less for that Source LOB column. In this ETL Process, you will be SELECTING just the Source Primary key and the Varchar(max) column only then load into a Stage Target Oracle Table that has a Varchar2(4000) data type. USE Microsoft Connector to Oracle by Attunity Drivers, in FAST Direct path MODE , and Parallel .for this SSIS ETL step

Illustration
Select Primary_key_ID, substring (1,Varchar_max_column,4000) as first_4000_char_col from Source_Table  where LEN(Varchar_max_column) <=4000
Insert into
Stage_Target_Table (Primar_key_ID, Varchar_4000_column)

The datatype of the Source Table in SSIS should be changed from TEXT to String. Same rule applies to your Target Oracle Table Datatype should be changed to String as well. This is how you will be able to load the column fast.

3) The last step of the Solution is to create another ETL process to load the Source Records that have the LOB column > 4000 chars.This records hopefully less than 5 million records will be loaded into the another Stage Target Table.
Illustration
Select Primary_key_ID, Varchar_max_column from Source_Table  where LEN(Varchar_max_column) >4000
Insert into
Stage_Target_Table_2 (Primar_key_ID, Varchar_4000_column)

in this step use the Microsoft Attunity for Oracle Destination in SSIS. Then make sure you Choose Regular Table load Option. This is compatible with LOB datatypes in Oracle. If you choose fast Direct Path -Parallel mode, it will not be allowing you the load the Varchar(max) column from the source table

4) Create a MERGE Statement script in Oracle that will update the value of the Varchar(max) or LOB column in your Target Table based on the Stage_target_tables. First update the LOB column for records with LOB less than 4000 characters then Update for Records with LOB data more than 4000.

5) Ensure to create unique indexes on the Primary keys of the Stage_Target_tables so that the Merge Statement will run fast. Also Divide the MERGE Statement in Batches based on the Partition scheme of your Target Table.

MORE ILLUSTRATIONS
Source SQL Server Table:

Create SQL Server Table: Create table dbo.My_Varchar_max_Table (Primary_keyID int, LOB_column Varchar(max))

In SSIS. Drag in your Data Flow Task

Then OLEDB Source


2. Now create your Select Query, instead of listing the Varchar(max) column as it is in the Source Table use substring( valuetext, 1, 4000 )  to limit it to the first 4000 characters. Like below




2b. Sometimes, the Derived column (Valuetext-4000) will be parsed in the Data flow as DT_TEXT which will slow your performance and would not allow you to insert into your Destination as a STR_ 4000

So you have to RIGHT CLICK on the OLEDB Source-->ADVANCED EDITOR-->Input and Output Properties
Then change the Data types of the INPUT and OUTPUT derived column (LOB_column_4000) to be STR with length 4000

3. Do your Data conversion or transformation, but when it comes to Destination table.

Lets first create an Oracle Destination Table: Here we would create an additional  for the 4000 characters which we will use to later update the CLOB Column.

Create table My_Schema.Target_LOB_Table (Primary_keyID Number, LOB_column_4000 varchar2(4000), LOB_column CLOB)

4. Drag your Attunity for Oracle Destination and configure





Previously, I had suggested using ODBC drivers instead on Attunity Drivers. I quickly realised that it would be work properly, see my reasons below
Unfortunately, this ODBC way has 2 main problems
1. If you load the data having both LOB columns and other data field columns, I noticed that it skips loading some LOB rows. That is in the source table there is a value in the LOB field for a particular record but when you query the Target table you see NULL for that record in the LOB field. It does it mess 70% of the time is pretty bad.

2. When you are trying to deploy your package from one environment to another or from one machine to another, you have to manually create the ODBC driver in that machine, not only that but you have to open up your package and reconfigure the connection manager for the ODBC in each environment. This is really not good.
way that has worked for me is by using ODBC destination to connect to Oracle target Table. Connection Manager has to be ODBC connector to Oracle Client. DON'T USE MICROSOFT 

Various Methods

Now if you reading up to here you Thanks, you have enough time on your hands lol
In my scenario I tried various options. Notably
1. Microsoft Connector to Oracle by Attunity: This was the first method that I tried. On the Attunity website it says it supports LOB which is true but it comes with a catch. You have to change Data Access mode to Table Name instead of the Data Access mode as Table Name Direct Path Fast Load. This really hurts performance so not acceptable to load all the 460 million LOB data with the Table Name Mode

2. ADO.NET Destination: This again is very slow and gives an error after about a million records. The error was TOO many bind variables. So it shows it is  was not built properly for loading data

3. OLEDB Destination : Again it is slow and gives an error after 10 million records because SSIS can not handle the Data flow buffer pipes, so occasionally the data overflows and your package fails.

4. Script Component as Destination. I heard it works but I am far from a .NET developer so it not my area. I would say for simple task that you have knowledge about use it. But if you don't know C# or VB and your team does not know it, Then you will have a problem if the code breaks.
The link below describes it. I have not tested it for myself but I don't know how good it will be loading 500GB records table.
http://download.microsoft.com/download/0/F/B/0FBFAA46-2BFD-478F-8E56-7BF3C672DF9D/OB_Loading_Data_Oracle.pdf

BECOME A BIG DATA ENGINEER IN 3 MONTHS with less than $100 investment

INTRO: Below is my guide to becoming a data engineer based on the current job market (08/08/2020) demands. I have outlined the TOP 5 foun...