Data Engineer, ETL, ELT Problem solver: 2015

Saturday, September 12, 2015

ODBC DRIVER

ODBC. Note you must have Oracle Client installed in your machine already for this and the driver built for any server you want to deploy it at.
See link for installing Oracle client.
http://www.oracle.com/technetwork/database/features/instant-client/index.html

Another thing that I had to do because the SSIS package ran for more than 48hrs was to increase PGA limit of Oracle to 30GB from default 2GB because my session had to be open for a long time.

DETAILS

Prerequisites for this to be usual guideline
1. SSIS 2012 or later Installed
2. A connection to an SQL Server Database with a table with at least 1 million records (Source Table) for testing purposes
3. A connection to an Oracle Database, where your destination table will be contained
4. Oracle Client 11g or later installed in your machine

step 1:
GO into DataFlow Task

Step 2:

Select your Source component. In my case it is SQL Server. Usually you might have other transformation in between before your Destination component. In this case the focus is on the destination

Drag the ODBC Destination into the data flow like shown in the picture.

Step 3

Double click on the ODBC Destination. (not shown) Then Click the 'New..' button to create an ODBC connection
You might see this page. if you don't see step 4. Anyway, Click the 'New..' button.. I see this page because I already created ODBC connection in my machine

Step 4:
You will be get another dialogue box for the Connection Manager. All you have to do is to Select the 'Use Connection String' Option and Click on 'Build;

Step 5:

Select the Machine Data Source. Click the New button (as always lol)\

step 6:
Select the Desired type of Data source. I prefer the machine Data source so that your package can be ran by any user on your machine. Click Next

step 7:
Scroll down. Skip the Microsoft ones, and Choose the Oracle in Oraclient11g or whatever name you used to install the Oracle Client in your machine, Then click Next. Then Finish

step 8
Enter the information as per your Oracle Database. The Data Source Name is up to you to define.

NOTE: If you plan to deploy your SSIS package to another machine make sure the Data source name matches the name on that machine so that when you install the driver on the other machine your package will not have validation error.. choose whatever configuration in the below box as per requirement. I felt mine with default setting.

step 9
Once you are done. You will eventually get back to this screen below

Step 10: Click a bunch of 'Ok' then select your ODBC Destination tables and start loading data.

This is just one way to do this. There are many other ways to install the Oracle ODBC connector.

like source/https://tensix.com/2012/06/setting-up-an-oracle-odbc-driver-and-data-source/

Saturday, July 18, 2015

How to LOAD 2 Billion Rows(500MB) Table from SQL SERVER to ORACLE Database with SSIS, Efficiently,(FAST, Manageability and Parallelism)

PROBLEM:
Currently, I am facing this issue, I have to design an ETL Process that will load a table with size of 2 BILLION ROWS into a Target Table in Oracle DB. Limitations: ETL tool is SSIS, 12hr network time block to run ETL per day after that connection will be lost and I will have to restart next day.

SOLUTION:

This is a no brainier, you have to split up that ETL Processes in logical divisions that can run in parallel. (DIVIDE AND CONQUER)
..but we only have one Source Table and one Target Table...how can I split it without having locking issues when loading to Target table?
Answer, create Stage tables or temp tables in Oracle DB (I prefer physical tables because with this amount of data a temp table might get corrupted). These Stage tables will store the logically divided source data that you can use PARTITION EXCHANGE to load upload into Oracle Target Table.

STEP 2: HOW TO SPLIT UP for Source Table in LOGICAL DIVISONS for PARALLEL Processing

This is the tricky part, if you are not carefully, you might slow down your load. or get no benefit. But it is still better to SPLIT the ETL Process in multiple ones just so to reduce dependency and improve debugging.

Case 1: SOURCE AND TARGET TABLE HAVE SIMILAR PARTITION SCHEMES

Let's say your source table has been Partitioned properly for example Range Partition on a Date Column (pretty typical). Please review this link if you are not comfortable with partitioning in SQL Server Database https://msdn.microsoft.com/en-us/library/ms187802.aspx
For example: 2001- 2005 (5years) in first Partition, 2006- 2007 (in another partition), 2008-2009 (in the final partition). let's say our current year is 2008.

So we have 3 Partitions. for this 2billion record table.Then you in our Target Table it was designed to match the Source table definition( That is same Partition.scheme)

In this case, I would recommend splitting by partition key.
INSERT INTO TARGET TABLE (PARTITION 1) SELECT * FROM SOURCE TABLE (PARTITION 1)

INSERT INTO TARGET TABLE (PARTITION 2) SELECT * FROM SOURCE TABLE (PARTITION 2)

INSERT INTO TARGET TABLE (PARTITION 3) SELECT * FROM SOURCE TABLE (PARTITION 3)

NOTE: Key thing to consider, is the Date_Column (or Partition key) has an index or is part of the primary key or can you specify the partition you want to SELECT FROM? IF YES, go ahead and split the ETL process in parallel this way.

NOTE: While using SSIS, you will not have the option to do "INSERT INTO Target Table (PARTITION 1)" Attunity Destination does not support that. (will need to confirm). The work around is to Create some STAGE tables where you can insert each Source Table partitions then use partition exchange to insert the table data into the main table

For example

INSERT INTO STAGE_TABLE_1 SELECT * FROM SOURCE TABLE (PARTITION 1)

INSERT INTO STAGE_TABLE_2 SELECT * FROM SOURCE TABLE (PARTITION 2)

INSERT INTO STAGE_TABLE_3 SELECT * FROM SOURCE TABLE (PARTITION 3)

BENEFIT: After loading into this stage tables just use PARTITION Exchange to exchange the Stage table data into the appropriate partition of the target table. read about it here https://oracle-base.com/articles/misc/partitioning-an-existing-table-using-exchange-partition
it takes 1 secs if you do it 'WITHOUT VALIDATION'. while it may take up to 5 minutes for millions of records "WITH VALIDATION'

WARNING: You may not see much more performance gain if you split the ETL process up more, like the period form 2001-2005 can be split into 3. However, you will have to combine those 3 tables for the first partition into one before doing the exchange partition. This would take up some time and needs TESTING to ensure you are not losing more time combining STG tables than just Pulling that one partition of the source into TARGET.

To combine table use datapump export tables_exists_action=append. to do it fast but it will still take some time.http://www.dba-oracle.com/t_table_exists_action_impdp.htm

CASE 2: SOURCE TABLE AND TARGET TABLE HAVE DIFFERENT PARTITION SCHEME

Source Table in this case has 3 partitions while Target table has 10 partitions. If your target table was partitioned by a date column we would use a where clause on the source table to channel the source data into the appropriate Target table partition.

Even if the source table does not have an index on that Date column or any column that you use as the partition key for your target table. It is still better to use that where clause to filter the data into the right partition.

for example

Insert into Target Table (partition 1) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) <2005

Insert into Target Table (partition 2) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2006

Insert into Target Table (partition 3) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2007

Insert into Target Table (partition 4) SELECT * FROM SOURCE_TABLE WHERE YEAR(DATE) =2008

ETC.

SSIS PACKAGE DESIGN

I usually create a separate SSIS package for each package so that the process is independent of the other partitions. I implement similar error handling logic and other logging functionalities for each and every partition. The benefit of this is that you can run and stop each partition ETL process as much as you want while running it in parallel with other ETL processes for the other partitions. You can also deploy different partitions and run them in different servers to optimize performance without affecting each other.

Sunday, July 5, 2015

How to Load Bit datatype Columns from SQL Server Table to Oracle Table using SSIS

Scenario: I am loading a table that contains a bit datatype column from SQL Server into an Oracle Table with number datatype. However, when I load the data using SSIS, the 1's gets converted to a -1 in Oracle Number datatype.

Description: In SQL Server the bit datatype column contains values 0's or 1's but we don't have a corresponding bit Datatype in Oracle. Some Oracle table design could use NUMBER (1) or use a VARCHAR2(5) [for 'True'/'False']. For performance reasons better to use NUMBER(1) though.

Solution

In SSIS, the bit data type from SQL Server gets converted to 'TRUE' or 'FALSE' in SSIS data type before any other conversion. Since there is no boolean datatype in Oracle, you have to use SSIS Data Conversion transformation to first convert the Bit to Numeric Datatype. While in Numeric datatype conversion, for some reason SSIS converts the 1's to -1's and 0's to 0's. To resolve this mishap. Use the Derived Column Transformation. To ensure you are converting the bit correctly use Derived Column Transformation to convert the BIT into 0's, 1's, or NULL depending on the data source value.

In this way your Oracle table will be sure to have those 0's or 1's from source.

Details

too come

Summary

Friday, July 3, 2015

How to Capture MilliSeconds in DateTime Columns in SQL Server and load into an Oracle Destination Table

Scenario; I want to transfer the data from a SQL Server Source table to an Oracle Destination table that has Date columns. I want to ensure, I copy all data (including milliseconds).

Description:
Typically, the datatype used in SQL Server tables for Date columns is DateTime which has a format of 23:59:59.997. sed However in Oracle the corresponding DataType that DateTime maps to using SSIS is 1998/05/31:12:00:00AM which truncates the milliseconds.

Depending on your project objectives or requirements it might be necessary to capture these milliseconds.

Solution:
In Oracle Table make the Date column to use Timestamp Datatype
In SSIS use Data Converstion to convert the SQL Datetime column to String. Then String maps to the Timestamp Datatype in Attunity Oracle Destination.

Detailed Solution

Too come

Comments

Sunday, June 28, 2015

Best way in Loading 460 millions of LOB data (varchar >4000)/varchar(max) in 1.7billion records table from SQL Server to Oracle Table LOB columns, fast using SSIS.

Scenario: I want to load data from a SQL Server table that has 1.7billion rows containing a column with data type varchar(max) which is populated 460 million times into an Oracle Table. ETL tool is SSIS

SOLUTION

--Note this method only works if your source Table has a unique key that can be used to update the Target table Later

Overview, The method here in tackling this problem is basically DIVIDE and CONQUER. You want to separate your ETL to perform the load for records where the Varchar(max) column will contain less than 4000 characters then have a separate process to Load the Source Data for Varchar(max) greater 4000 characters, which should be a few subset of records hopefully.

Then after loading these to Stage Tables in your Oracle database, I would use Merge Statement to update the column values for that LOB column using the different tables that I loaded. This merge Statement might take time but it is the better option than using SSIS to load directly into a CLOB field

Before doing this guidelines outlined here. I would say investigate your source table data. Use the LEN function to check how many rows are having characters more than 4000 and how many less than 4000. In this way you will save time and know if this method will work for you.

For example: SELECT count(*) FROM Source_Table where len(varchar_max_col) <=4000

STEP 1: PARALLEL PROCESSING and BATCH PROCESSING
Please review link on parallel processing for ETL Design
http://plsqlandssisheadaches.blogspot.com/2015/07/how-to-load-2-billion-rows2-terabyte.html

STEP 2: Method in loading Table with Varchar (max) into Oracle CLOB column fast
1. Create ETL Process to load all the columns of source into target table except the LOB column. Load all the columns expect the "VARCHAR(MAX)" column. In other words, omit Selecting the Varchar(max) column from source table and let it be nullable in your target table, while you load the rest of the source table data into the target table. USE Microsoft Connector to Oracle by Attunity Drivers, in FAST Direct path MODE , and Parallel .for this SSIS ETL step

2. Create a second ETL Process just to load the Source Table data that has 4000 Characters or less for that Source LOB column. In this ETL Process, you will be SELECTING just the Source Primary key and the Varchar(max) column only then load into a Stage Target Oracle Table that has a Varchar2(4000) data type. USE Microsoft Connector to Oracle by Attunity Drivers, in FAST Direct path MODE , and Parallel .for this SSIS ETL step

Illustration
Select Primary_key_ID, substring (1,Varchar_max_column,4000) as first_4000_char_col from Source_Table where LEN(Varchar_max_column) <=4000
Insert into
Stage_Target_Table (Primar_key_ID, Varchar_4000_column)

The datatype of the Source Table in SSIS should be changed from TEXT to String. Same rule applies to your Target Oracle Table Datatype should be changed to String as well. This is how you will be able to load the column fast.

3) The last step of the Solution is to create another ETL process to load the Source Records that have the LOB column > 4000 chars.This records hopefully less than 5 million records will be loaded into the another Stage Target Table.
Illustration
Select Primary_key_ID, Varchar_max_column from Source_Table where LEN(Varchar_max_column) >4000
Insert into
Stage_Target_Table_2 (Primar_key_ID, Varchar_4000_column)

in this step use the Microsoft Attunity for Oracle Destination in SSIS. Then make sure you Choose Regular Table load Option. This is compatible with LOB datatypes in Oracle. If you choose fast Direct Path -Parallel mode, it will not be allowing you the load the Varchar(max) column from the source table

4) Create a MERGE Statement script in Oracle that will update the value of the Varchar(max) or LOB column in your Target Table based on the Stage_target_tables. First update the LOB column for records with LOB less than 4000 characters then Update for Records with LOB data more than 4000.

5) Ensure to create unique indexes on the Primary keys of the Stage_Target_tables so that the Merge Statement will run fast. Also Divide the MERGE Statement in Batches based on the Partition scheme of your Target Table.

MORE ILLUSTRATIONS
Source SQL Server Table:

Create SQL Server Table: Create table dbo.My_Varchar_max_Table (Primary_keyID int, LOB_column Varchar(max))

In SSIS. Drag in your Data Flow Task

Then OLEDB Source

2. Now create your Select Query, instead of listing the Varchar(max) column as it is in the Source Table use substring( valuetext, 1, 4000 ) to limit it to the first 4000 characters. Like below

2b. Sometimes, the Derived column (Valuetext-4000) will be parsed in the Data flow as DT_TEXT which will slow your performance and would not allow you to insert into your Destination as a STR_ 4000

So you have to RIGHT CLICK on the OLEDB Source-->ADVANCED EDITOR-->Input and Output Properties
Then change the Data types of the INPUT and OUTPUT derived column (LOB_column_4000) to be STR with length 4000

3. Do your Data conversion or transformation, but when it comes to Destination table.

Lets first create an Oracle Destination Table: Here we would create an additional for the 4000 characters which we will use to later update the CLOB Column.

Create table My_Schema.Target_LOB_Table (Primary_keyID Number, LOB_column_4000 varchar2(4000), LOB_column CLOB)

4. Drag your Attunity for Oracle Destination and configure

Previously, I had suggested using ODBC drivers instead on Attunity Drivers. I quickly realised that it would be work properly, see my reasons below
Unfortunately, this ODBC way has 2 main problems
1. If you load the data having both LOB columns and other data field columns, I noticed that it skips loading some LOB rows. That is in the source table there is a value in the LOB field for a particular record but when you query the Target table you see NULL for that record in the LOB field. It does it mess 70% of the time is pretty bad.

2. When you are trying to deploy your package from one environment to another or from one machine to another, you have to manually create the ODBC driver in that machine, not only that but you have to open up your package and reconfigure the connection manager for the ODBC in each environment. This is really not good.
way that has worked for me is by using ODBC destination to connect to Oracle target Table. Connection Manager has to be ODBC connector to Oracle Client. DON'T USE MICROSOFT

Various Methods

Now if you reading up to here you Thanks, you have enough time on your hands lol
In my scenario I tried various options. Notably
1. Microsoft Connector to Oracle by Attunity: This was the first method that I tried. On the Attunity website it says it supports LOB which is true but it comes with a catch. You have to change Data Access mode to Table Name instead of the Data Access mode as Table Name Direct Path Fast Load. This really hurts performance so not acceptable to load all the 460 million LOB data with the Table Name Mode

2. ADO.NET Destination: This again is very slow and gives an error after about a million records. The error was TOO many bind variables. So it shows it is was not built properly for loading data

3. OLEDB Destination : Again it is slow and gives an error after 10 million records because SSIS can not handle the Data flow buffer pipes, so occasionally the data overflows and your package fails.

4. Script Component as Destination. I heard it works but I am far from a .NET developer so it not my area. I would say for simple task that you have knowledge about use it. But if you don't know C# or VB and your team does not know it, Then you will have a problem if the code breaks.
The link below describes it. I have not tested it for myself but I don't know how good it will be loading 500GB records table.
http://download.microsoft.com/download/0/F/B/0FBFAA46-2BFD-478F-8E56-7BF3C672DF9D/OB_Loading_Data_Oracle.pdf

Data Engineer, ETL, ELT Problem solver