Incremental partitioned reload – Part II
11/29/2013 Deja un comentario
In the first part of this article we saw what it was an incremental load and how it’s done using a technique that meets most of the situations that we can find.
But as I said there are cases where the volume of data being handled is so large (on the order of terabytes) than simply reading and / or recording involved QVD file can take a considerable amount of time and server resources, being that maybe we just need add a few news to it.
To avoid this situation what we should do is to split this file into several fragments which each contain a set of rows grouped by some criteria. One approach that is often used is the month-year of a record’s field, thereby generating QVDs repositories that looks something like this:Extraction_Customers_CreateDate_012013.QVD Extraction_Customers_CreateDate_022013.QVD Extraction_Customers_CreateDate_032013.QVD Etc….
Where a fixed part of the file name describes the content/source/stage and a variable part describes the period (month-year) corresponding to the records in each file, in this case the customer’s creation Date.
Although we can not always have a date we can always count on a numeric field by which to classify and group records, either the Id field or other unique identifier that can be created using the Autonumber() or Hash…() functions.
The important thing here is that the field you choose must meet the following characteristics:– MUST be numeric: to define ranges or segments.
– NEVER change value over time: in order to locate in what QVD file is stored as a record.
– Should have a statistical distribution close to random (ie the class frequencies should tend to be equal)
– As a consequence of this should not have empty ranges: using the example customer’s creation date this means that customers should be created EVERY month in a continuous way.
The last two points are not mandatory, but rather a recommendation for best performance.
In general the same unique identifier used to perform incremental load is an excellent (if not the best) candidate field to partition the QVD, because at the source (transactional) system it is usually numeric, time-invariant (PK), and auto-increment (1 to 1).
To improve readability of the code we will use procedures (Subs) operating similar to a function or routine other languages. Here’s the code of the procedure commented which should be saved in a file called “incrementalPartitionedReload.txt” that must be in the same folder as the qvw on which we are working:
sub loadIncrementalPartitioned (V_chunck,V_KEY,V_QVDname,V_SQLquery,V_incrementalWhere,V_fieldsLoaded,V_fieldsTransformations)
// set the variable used to recover ONLY the news (since last execution)
LET v_CompareDate_1 = DATE(TODAY()-1,’YYYY-MM-DD’);
// Check if the first full load has been done
IF NOT ISNULL(QvdCreateTime(‘$(_QVDsPath)$(V_QVDname)_incrementalReload_.QVD’)) THEN
//Extract data from datasource
//for each row calculate the corresponding patition and make transformations (if any)
floor($(V_KEY)/$(V_chunck)) as TARGET_QVD
DROP TABLE pre_$(TABLA_SOURCE); //Fill a little table which will be used to build the affected (if already exist) QVDs files names, each one containing upto V_chunk rows
LOAD DISTINCT TARGET_QVD
// get the quantity of QVDs that will be used
LET rows = NoOfRows(‘AffectedQVDs’);
FOR row=1 to rows
// assembles QVD file name
let txtQVD ='[‘ & (FieldValue(‘TARGET_QVD’,row) * V_chunck) &’;’& ((FieldValue(‘TARGET_QVD’,row)+1) * V_chunck – 1) & ‘]’;
// If the QVD file already exists add the news to this, discarding the old versions of same records
IF NOT ISNULL(QvdCreateTime(‘$(_QVDsPath)$(V_QVDname)=$(txtQVD).QVD’)) THEN
FROM $(_QVDsPath)$(V_QVDname)=$(txtQVD).QVD (QVD)
WHERE NOT EXISTS (PRIMARY_KEY);
// Is the first run of this script. It’s necessary to do a full SQL extraction
//for each row calculate the corresponding patition and make field transformations (if any)
floor($(V_KEY)/$(V_chunck)) as TARGET_QVD
DROP TABLE pre_$(TABLA_SOURCE);
//will store a dummy file just for signal that the first full load has been done
LOAD 1 as dummy
STORE flagFile INTO $(_QVDsPath)$(V_QVDname)_incrementalReload_.QVD;
DROP TABLE flagFile;
//Recover the max and min value of the field used to partition
LOAD min($(V_KEY)) as minPK,
max($(V_KEY)) as maxPK
// calculates the number of partitions
do while V_I<= partitions
if (V_I*V_chunck>=minPK) then
// extract the chunk for each partition
WHERE $(V_KEY)>= ($(V_I)-1)*$(V_chunck) and $(V_KEY)<$(V_I)*$(V_chunck);
LET q = NoOfRows(‘toStore’);
//verify if the partition is not empty
IF q>0 then
Let since=(V_I-1)*V_chunck ;
// store the partition, using an assembled file name with information about the content
STORE toStore INTO $(_QVDsPath)$(V_QVDname)=[$(since);$(upto)].QVD;
DROP TABLE toStore;
drop table $(TABLA_SOURCE);
// prepare variable to detect the news in the next run
LET v_CompareDate_2 = ‘$(v_CompareDate_1)’; end Sub
and finally we complete in the qvw script our extractor, in this case for the Customers table://////////////LIBRARIES/////////////////////////
// Load library
$(Include=.\incrementalPartitionedReload.txt); //////////////END LIBRARIES////////////////////
CONNECT TO […YOUR DB CONNECTION STRING….]; // Quantity of IDs in each QVD’s partition, change to build bigger partitions and therefore more or less QVDs LET _chunk = 500; // Unique ID used to patitioning and insert the news LET _Key=’IDCUSTOMER’; // First part of QVDs file name
LET _QVDname=’Extract_CUSTOMER_’ & _Key; // Extraction SQL query SET _SQLquery=SELECT * FROM CUSTOMERS; // WHERE clause which detects the news (inserts + updates)
SET _incrementalWhere= WHERE LASTUPDATE >= TO_DATE(‘$(v_CompareDate_2)’,’YYYY-MM-DD’); // Datasource fields read and load to QV
SET _FieldsLoaded= IDCUSTOMER, NAME, STREET, NUMBER, CITY, LASTUPDATE; // Optional: filds tranformations
SET _transformFields= ,CITY & ‘,’ & STREET & ‘ ‘ & NUMBER AS FULL_ADDRESS; // Could be empty – Optional, but if it is must begin with “,” // QVDs path
SET _QVDsPath =..\Qvd\; // call the incremental partitioned load:
call loadIncrementalPartitioned(_chunk,_Key, _QVDname, _SQLquery, _incrementalWhere, _FieldsLoaded, _transformFields);
Substituting the values of the variables _chunk, _Key, _QVDname, _SQLquery, _incrementalWhere, _FieldsLoaded, _QVDsPath and _transformFields can easily adapt and create incremental scripts to load partitioned.
For example, with _chunk = 500 we get something like this:
– The first time the script is run in addition to a full extraction of the datasource and save QVDs with data, it saves a “dummy” QVD file which is only used to indicate (flag) that the first run was already done. If this file is deleted the script will return to repeat a full reloading.
– The _chunk variable determines the number of rows in each QVD and QVDs quantity. The number varies depending on the number of rows in the table and if there are “holes” (discontinuities) in the numbering.
– If we do not want partition the table but still keep doing incremental reloading must only define a value _chunk large enough so that all Ids enter into the single partition that will be created.
– Finally to read and use QVDs created in the layer extraction in our intermediate layer (or presentation depending on the application architecture) we just make use of wildcar “*” to load all the parts as if they were one. P.Ej: