Incremental partitioned reload – Part I
11/27/2013 Deja un comentario
In this article I’ll explain how to make one (usually daily) periodic incremental load and provide the code to perform incremental loads partitioned by any numeric field.
Basically an incremental load works as follows:
This means that for the first time perform a full load and then just take each day’s news of the previous day and then add to these news everything that was already stored in the QVD and that is NOT on these news.
For example if a client (ID = 99) on day X is created on data source, and his address declared as “Route 66 Mi 456” that day will detect this row as a novelty, because it does not exist in the QVD, and therefore will be added to qvd. Later on X + N day customer moves to “Route 2 Mi 38” and the process of incremental reload will detect the novelty and will add the rest of the QVD (other customers), discarding the data which had kept to the client 99 (Route 66 Mi 456) and instead keeping customer ID=99 data that were recovered from the BD.
Achieve this requires that the table on which it will do incremental recharge meets two characteristics:
1- It must has a unique identifier.
2- It must have a field of type date or date-time where each time a record is created and / or modified in ANY way by the source system (transactional) this field is updated with the current date-time. This task is a commitment that must be agreed with the responsible source system.
To specify the explained and move forward here’s the code for a basic incremental recharge:
Set TimestampFormat=’YYYY-MM-DD hh:mm:ss[.fff]’; //Spanish format
LET v_CompareDate_1 = DATE(TODAY()-1,’YYYY-MM-DD’); // Yesterday
//Checks if the QVD file exists for the first time
IF NOT ISNULL(QvdCreateTime(‘incrementalTable.QVD’)) THEN
//It is not the first time, because the QVD file exists
//Recover all the news (new records + updates since last load (v_CompareDate_2)
LOAD *; //If prefer can detail a list of fields
SQL SELECT * //Can detail fields list
WHERE LastUpdateDate >= TO_DATE(‘$(v_CompareDate_2)’,’YYYY-MM-DD’);
//Adds records stored en prvious days into QVD file and doesn’t exist as news
FROM incrementalTable.QVD (QVD)
WHERE NOT EXISTS (UNIQUEID);
// first time running: the QVD file doesn’t exist
// So it makes a full load
SQL SELECT * FROM sourceTable;
STORE incrementalTable incrementalTable.QVD;
DROP TABLE incrementalTable;
LET v_CompareDate_2 = ‘$(v_CompareDate_1)’;
We will highlight some points of the code:
1) How to detect the latest news?
We do read all of BD what happened after the last time ran this same script: “WHERE LastUpdateDate> = TO_DATE (‘$ (v_CompareDate_2)’, ‘YYYY-MM-DD’)”. This could be yesterday or two weeks ago, it does not matter because the date of the last run (less one day) was stored in “v_CompareDate_2” variable (remember that QV variables are persistent over time).
2) How to avoid loading new rows that had been previously loaded (duplicity)?
If the QVD had rows stored already present among the news we will keep these news and discard what was stored: “WHERE NOT EXISTS (UNIQUEID)”
3) How the join the news with other previously stored data?
Concatenating two sets (news + “legacy” data): “CONCATENATE (incrementalTable)”
4) How “prepare” the script for the next run? Seting the value of “v_CompareDate_2” to “v_CompareDate_1”: “LET v_CompareDate_2 = ‘$ (v_CompareDate_1)'”
Thus we have a script able to load an entire table in his first run and after that just extract and store the news replacing the old versions of these records if any.
However although this technique optimizes BD reading and transfer of data it has a performance issue: if the resulting QVD is truly large (> 1 or 2 TByte) , every time it runs reads and saves a very large file to disk, so even with fairly powerful servers it can take on the order of 10-20 minutes, doing heavy execution and consuming valuable time of an operating window that in many organizations is often narrowed to a few dawn hours.