Incremental partitioned reload – Part I

In this article I’ll explain how to make one (usually daily) periodic incremental load and provide the code to perform incremental loads partitioned by any numeric field.

Basically an incremental load works as follows:

incrementalLoad

 This means that for the first time perform a full load and then just take each day’s  news of the previous day and then add to these news everything that was already stored in the QVD and that is NOT on these news.

For example if a client (ID = 99) on day X is created on data source, and his address declared as “Route 66 Mi 456” that day will detect this row as a novelty, because it does not exist in the QVD, and therefore will be added to qvd. Later on X + N day customer moves to “Route 2 Mi 38” and the process of incremental reload will detect the novelty and will add the rest of the QVD (other customers), discarding the data which had kept to the client 99 (Route 66 Mi 456) and instead keeping customer ID=99 data  that were recovered from the BD.

Achieve this requires that the table on which it will do incremental recharge meets two characteristics:

1-      It must has a unique identifier.

2-      It must have a field of type date or date-time where each time a record is created and / or modified in ANY way by the source system (transactional) this field is updated with the current date-time. This task is a commitment that must be agreed with the responsible source system.

To specify the explained and move forward here’s the code for a basic incremental recharge:

Set TimestampFormat=’YYYY-MM-DD hh:mm:ss[.fff]’; //Spanish format

LET v_CompareDate_1 = DATE(TODAY()-1,’YYYY-MM-DD’); // Yesterday

//Checks if the QVD file exists for the first time
IF NOT ISNULL(QvdCreateTime(‘incrementalTable.QVD’)) THEN
//It is not the first time, because the QVD file exists

//Recover all the news (new records + updates since last load (v_CompareDate_2)

incrementalTable:
LOAD *; //If prefer can detail a list of fields
SQL SELECT *  //Can detail fields list
FROM sourceTable
WHERE LastUpdateDate >= TO_DATE(‘$(v_CompareDate_2)’,’YYYY-MM-DD’);

//Adds records stored en prvious days into QVD file and doesn’t exist as news
CONCATENATE (incrementalTable)
LOAD *
FROM incrementalTable.QVD (QVD)
WHERE NOT EXISTS (UNIQUEID);

ELSE
// first time running: the QVD file doesn’t exist

// So it makes a full load

incrementalTable:
LOAD *;
SQL SELECT * FROM sourceTable;

END IF

STORE incrementalTable incrementalTable.QVD;
DROP TABLE incrementalTable;

LET v_CompareDate_2 = ‘$(v_CompareDate_1)’;
 

We will highlight some points of the code:

1) How to detect the latest news?

We do read all of BD what happened after the last time ran this same script: “WHERE LastUpdateDate> = TO_DATE (‘$ (v_CompareDate_2)’, ‘YYYY-MM-DD’)”. This could be yesterday or two weeks ago, it does not matter because the date of the last run (less one day) was stored in “v_CompareDate_2” variable (remember that QV variables are persistent over time).

2) How to avoid loading new rows that had been previously loaded (duplicity)?

If the QVD had rows stored already present among the news we will keep these news and discard what was stored: “WHERE NOT EXISTS (UNIQUEID)”

3) How the join the news with other previously stored data?

Concatenating two sets (news + “legacy” data): “CONCATENATE (incrementalTable)”

4) How “prepare” the script for the next run? Seting the value of  “v_CompareDate_2” to “v_CompareDate_1”:   “LET v_CompareDate_2 = ‘$ (v_CompareDate_1)'”

Thus we have a script able to load an entire table in his first run and after that just extract and store the news replacing the old versions of these records if any.

However although this technique optimizes BD reading and transfer of data it has a performance issue:  if the resulting QVD is truly large (> 1 or 2 TByte) ,  every time it runs reads and saves a very large file to disk, so even with fairly powerful servers it can take on the order of 10-20 minutes, doing heavy execution and consuming valuable time of an operating window that in many organizations is often narrowed to a few dawn hours.

 In the second part of this article I will explain how to modify and package this code to read and write only small portions of QVD (actually many small QVDs), optimizing the time and resource disk access.

Acerca de pparnisari
Casi 44 vueltas al sol. Licenciado en sistemas. Curioso, investigador, excelente para resolver problemas prácticos. Casado, 2 hijos, cada vez con mas preguntas y respuestas mas en duda. Almost 44 laps to the sun. System engineer. Curious, researcher, great for solving practical problems. Married, 2 children, each time with more questions and more answers in doubt.

Deja un comentario

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s

A %d blogueros les gusta esto: