Incremental partitioned reload – Part II

In the first part of this article we saw what it was an incremental load and how it’s done using a technique that meets most of the situations that we can find.

But as I said there are cases where the volume of data being handled is so large (on the order of terabytes) than simply reading and / or recording involved QVD file can take a considerable amount of time and server resources, being that maybe we just need add a few news to it.

To avoid this situation what we should do is to split this file into several fragments which each contain a set of rows grouped by some criteria. One approach that is often used is the month-year of a record’s field, thereby generating QVDs repositories that looks something like this:

                                 Extraction_Customers_CreateDate_012013.QVD
                                 Extraction_Customers_CreateDate_022013.QVD
                                 Extraction_Customers_CreateDate_032013.QVD
                                 Etc….

Where a fixed part of the file name describes the content/source/stage and a variable part describes the period (month-year) corresponding to the records in each file, in this case the customer’s creation Date.

Although we can not always have a date we can always count on a numeric field by which to classify and group records, either the Id field or other unique identifier that can be created using the Autonumber() or Hash…() functions.

The important thing here is that the field you choose must meet the following characteristics:

      – MUST be numeric: to define ranges or segments.
      – NEVER change value over time: in order to locate in what QVD file is stored as a record.
      – Should have a statistical distribution close to random (ie the class frequencies should tend to be equal)
      – As a consequence of this should not have empty ranges: using the example customer’s creation date this means that customers should be created EVERY month in a continuous way.

The last two points are not mandatory, but rather a recommendation for best performance.

In general the same unique identifier used to perform incremental load is an excellent (if not the best) candidate field to partition the QVD, because at the source (transactional) system it is usually numeric, time-invariant (PK), and auto-increment (1 to 1).

To improve readability of the code we will use procedures (Subs) operating similar to a function or routine other languages. Here’s the code of the procedure commented which should be saved in a file called “incrementalPartitionedReload.txt” that must be in the same folder as the qvw on which we are working:
 
sub loadIncrementalPartitioned (V_chunck,V_KEY,V_QVDname,V_SQLquery,V_incrementalWhere,V_fieldsLoaded,V_fieldsTransformations)

     Set TimestampFormat=’YYYY-MM-DD hh:mm:ss[.fff]’;
    
LET TABLA_SOURCE=’DATA_SOURCE’;
    
    
// set the variable used to recover ONLY the news (since last execution)
     LET v_CompareDate_1 = DATE(TODAY()-1,’YYYY-MM-DD’);
    
    
// Check if the first full load has been done
     IF NOT ISNULL(QvdCreateTime(‘$(_QVDsPath)$(V_QVDname)_incrementalReload_.QVD’)) THEN
    
        
//Extract data from datasource
           pre_$(TABLA_SOURCE):
          
LOAD $(V_fieldsLoaded);
          
SQL $(V_SQLquery)
          
$(V_incrementalWhere);
          
          
//for each row calculate the corresponding patition and make transformations (if any)
           $(TABLA_SOURCE):
          
NoConcatenate
          
LOAD *,
               
floor($(V_KEY)/$(V_chunck)) as TARGET_QVD
               
$(V_fieldsTransformations)
           RESIDENT pre_
$(TABLA_SOURCE);  
          
DROP TABLE pre_$(TABLA_SOURCE);
           //Fill a little table which will be used to build the affected (if already exist) QVDs files names, each one containing upto V_chunk rows        
           AffectedQVDs:
          
LOAD DISTINCT TARGET_QVD
          
RESIDENT $(TABLA_SOURCE);
          
          
// get the quantity of QVDs that will be used
           LET rows = NoOfRows(‘AffectedQVDs’);
    
          
FOR row=1 to rows
              
// assembles QVD file name
                let txtQVD ='[‘ & (FieldValue(‘TARGET_QVD’,row) * V_chunck) &’;’& ((FieldValue(‘TARGET_QVD’,row)+1) * V_chunck – 1) & ‘]’;
               
// If the QVD file already exists add the news to this, discarding the old versions of same records
                IF NOT ISNULL(QvdCreateTime(‘$(_QVDsPath)$(V_QVDname)=$(txtQVD).QVD’)) THEN         
                         
CONCATENATE ($(TABLA_SOURCE))
                         
$(TABLA_SOURCE):
                         
LOAD *
                         
FROM $(_QVDsPath)$(V_QVDname)=$(txtQVD).QVD (QVD
                         
WHERE NOT EXISTS (PRIMARY_KEY);
               
ENDIF     
    
          
NEXT
    
    
ELSE
    
// Is the first run of this script. It’s necessary to do a full SQL extraction
    
           pre_
$(TABLA_SOURCE):
          
LOAD $(V_fieldsLoaded);
          
SQL $(V_SQLquery);
          
    
          
//for each row calculate the corresponding patition and make field transformations (if any)
           $(TABLA_SOURCE):
          
NoConcatenate
          
LOAD *,
               
floor($(V_KEY)/$(V_chunck)) as TARGET_QVD
               
$(V_fieldsTransformations)
           RESIDENT pre_
$(TABLA_SOURCE);  
          
DROP TABLE pre_$(TABLA_SOURCE);
    
          
//will store a dummy file just for signal that the first full load has been done
           flagFile:
          
NoConcatenate
          
LOAD 1 as dummy
          
AutoGenerate 1;
          
STORE flagFile INTO $(_QVDsPath)$(V_QVDname)_incrementalReload_.QVD;
          
DROP TABLE flagFile;
    
    
END IF
    
    
//Recover the max and min value of the field used to partition
     MAXI_MINI:
    
LOAD min($(V_KEY)) as minPK,
          
max($(V_KEY)) as maxPK
    
RESIDENT $(TABLA_SOURCE);
    
    
    
LET maxPK=round(Peek(‘maxPK’));
    
LET minPK=round(Peek(‘minPK’));
    
// calculates the number of partitions
     LET partitions=ceil(maxPK/V_chunck);
    
    
LET V_I=1;
    
    
    
do while V_I<= partitions
          
if (V_I*V_chunck>=minPK) then
               
// extract the chunk for each partition        
                toStore:
               
NoConcatenate
               
LOAD *
               
RESIDENT $(TABLA_SOURCE)
               
WHERE $(V_KEY)>= ($(V_I)-1)*$(V_chunck) and $(V_KEY)<$(V_I)*$(V_chunck);
               
               
LET q = NoOfRows(‘toStore’);
               
               
//verify if the partition is not empty
                IF q>0 then
                    
Let since=(V_I-1)*V_chunck ;
                    
Let upto=(V_I*V_chunck)-1;
                    
// store the partition, using an assembled file name with information about the content
                     STORE toStore INTO $(_QVDsPath)$(V_QVDname)=[$(since);$(upto)].QVD;
               
ENDIF
               
               
DROP TABLE toStore;
          
ENDIF
          
let V_I=V_I+1;
    
loop
    
drop table $(TABLA_SOURCE);
    
    
// prepare variable to detect the news in the next run
     LET v_CompareDate_2 = ‘$(v_CompareDate_1)’;
end Sub

and finally we complete in the qvw script our extractor, in this case for the Customers table:

 //////////////LIBRARIES/////////////////////////
// Load library
$(Include=.\incrementalPartitionedReload.txt);
//////////////END LIBRARIES////////////////////

CONNECT TO […YOUR DB CONNECTION STRING….];
// Quantity of IDs in each QVD’s partition, change to build bigger partitions and therefore more or less QVDs
LET _chunk = 500;
// Unique ID used to patitioning and insert the news
LET _Key=’IDCUSTOMER’;
// First part of  QVDs file name
LET _QVDname=’Extract_CUSTOMER_’ & _Key;
// Extraction SQL query
SET _SQLquery=SELECT * FROM CUSTOMERS;
// WHERE clause which detects the news (inserts + updates)
SET _incrementalWhere= WHERE LASTUPDATE >= TO_DATE(‘$(v_CompareDate_2)’,’YYYY-MM-DD’);
// Datasource fields read and load to QV
SET _FieldsLoaded= IDCUSTOMER, NAME, STREET, NUMBER, CITY, LASTUPDATE;
// Optional: filds tranformations
SET _transformFields= ,CITY & ‘,’ & STREET & ‘ ‘ & NUMBER AS FULL_ADDRESS; // Could be empty – Optional, but if it is must begin with “,”
// QVDs path
SET _QVDsPath =..\Qvd\;
// call the incremental partitioned load:
call loadIncrementalPartitioned(_chunk,_Key, _QVDname, _SQLquery, _incrementalWhere, _FieldsLoaded, _transformFields);

 

Substituting the values ​​of the variables _chunk, _Key, _QVDname, _SQLquery, _incrementalWhere, _FieldsLoaded, _QVDsPath and _transformFields can easily adapt and create incremental scripts to load partitioned.

For example, with _chunk = 500 we get something like this:

QVDS Partitioned

Last considerations:

 – The first time the script is run in addition to a full extraction of the datasource and save QVDs with data, it saves a “dummy” QVD file which is only used to indicate (flag) that the first run was already done. If this file is deleted the script will return to repeat a full reloading.

 – The _chunk variable determines the number of rows in each QVD and QVDs quantity. The number varies depending on the number of rows in the table and if there are “holes” (discontinuities) in the numbering.

– If we do not want partition the table but still keep doing incremental reloading must only define a value _chunk large enough so that all Ids enter into the single partition that will be created.

 – Finally to read and use QVDs created in the layer extraction in our intermediate layer (or presentation depending on the application architecture) we just make use of wildcar “*” to load all the parts as if they were one. P.Ej:

 LOAD *
FROM
[D:\Ventas\Qvd\Extraction_Customer_IDCUSTOMER=[*]].QVD]
(qvd);

Acerca de pparnisari
Casi 44 vueltas al sol. Licenciado en sistemas. Curioso, investigador, excelente para resolver problemas prácticos. Casado, 2 hijos, cada vez con mas preguntas y respuestas mas en duda. Almost 44 laps to the sun. System engineer. Curious, researcher, great for solving practical problems. Married, 2 children, each time with more questions and more answers in doubt.

Deja un comentario

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s

A %d blogueros les gusta esto: