OCI Data Integration — Incremental data load
As part of a Data Warehouse project, at some point there is a discussion about how to load incremental data into the Data Warehouse. There are multiple ways to achieve that, for example David Allan showed in this post how to use metadata to bring extracts from Fusion Apps.
This is just one more example, you can adapt it to your own use case.
The use case is: Read data from any source, load into staging tables (overwrite mode) and then load the data into Data Warehouse table (insert mode). I’m not going to show, how to create a Data Flow, Tasks etc. You can find plenty of examples on Oracle LiveLabs and Oracle Blogs for OCI Data Integration.
What do we need to configure?
- Control table
- Database Procedures
- Filters and parameters on Data Flows
The control table as the name suggests, it will be used to control the ETL processes and keep the history of data extracts.
We will have 2 procedures:
- LOAD_ETL_PARAMETERS — It has 2 input parameters (SOURCE and TABLE) and output (MAX_LOAD_DATE). It uses the control table to search what was the last loaded date.
Create a SQL Task on OCI Data Integration for each procedure. You can define some default values if you want:
Filters and Parameters
We need to use filters with parameters (defined by user) :
Put all pieces together
Once you have created the data flows, integration tasks and SQL tasks. Now, you will call everything in one pipeline.
You can have multiple processes in parallel. The first step is the SQL Task that contains the Procedure LOAD_ETL_PARAMETERS. This procedure has 2 input parameters (Source and Table name), please assign them a value. It’s the only step that requires hard coded values.
The second step, you define the Integration Task (data flow) and receives as a input parameter, the output of previous step:
The data flow will receive the parameter in real time and filter the data for that specific date.
The third step saves the output of the Integration Task (data flow) execution into control table. For that, we call the procedure PRC_MANAGE_ETL that has 8 input parameters. The first 2 parameters (Source and Table name) are assigned manually, but the remain ones you use the “previous output parameter”:
The last step would be the load of data into Data Warehouse table.
I hope this helps. If you have any doubt, please reach out to me through LinkedIn or Medium.