LookUp, Apply Map, Group by, Select Distinct, Left Join: different techniques to build Flags

This post aims to document and collect the main methods I know to “flag” rows in a model’s table in relation to another, using a field created for that purpose and adding it to that table.

The reasons why we usually mark rows by using flags are generally performance and / or volume: in the first case can count, add or do some aggregation as a single table avoiding walking relationships that can be complex and in the second case only want to qualify the first entity according to the second entity, but no need to preserve the rest of the data, saving space. In short it is a very widely used technique to build a table of facts (fact tables) in a star pattern and some less used on a snow-flake model.

While there are various techniques the most important is to have perfectly clear the relationship between the two tables and its cardinality because of the latter depend on the set of techniques that can be applied without generating errors mirroring rows.

If the relationship between the tables is 1 to 0-or-1:

As an example take two tables: Employees and Spouses (table for husbands, wives, spouses in general):

Employees:
load * inline [
EmpId, EmpName
1,   Pedro
2,   Maria
3,   Susana
4,   John
5,   Carlos
6,  Sofia
];

Spouses:
Load * inline [
SpouseID, EmpId, Name
1, 1, Sonia
2, 2, Paul
3, 4, Gabriela
5, 6, Serge
];
 

What we seek is to make employees adding “Spouse” field (spouse) and assign the value Y / N as found or not a row in the table of “Spouse”.

Using “Left Join”:

Here the code:

LEFT JOIN (Employees)
LOAD
     EmpId,                                                  /// link field
     if(isNull(EmpId),’N’,’Y’) as Spouse
RESIDENT Spouses;

Drop Table Spouses;

And the result we get is: Cero

But Susana and Carlos instead of showing “N” (as would be the requirement of user) in the Spouse field show “-” (null). This is because the code “if (isNull (EmpId), ‘N’, ‘Y’)” is executed for each line of the table Spouse, where indeed there are no rows that match Carlos and Susana and EmpId obviously never becomes zero in said table.

For this we need to finalize the order makes a second pass, costly and inadvisable:
tmpEmployees:
Noconcatenate
Load
     EmpId,
     EmpName,
     if(isNull(Spouse),’Y’,’N’) as Spouse
RESIDENT Employees;

Drop table Employees;
Rename Table tmpEmployees to Employees;
 

Here we seek the fields where “Spouse” is null and change this value to “N”, obtaining the following result: quinta

Finally the use of “Left Join” could cause duplicate rows if in any case the ratio of 1-to-0/1 wasn’t so, then creating a difficult situation to detect between (sometimes) million records.

Using “Apply Map”:

This is a lot more recommended technique for performance, especially when working with large volumes, and require a single pass (in addition to building the map): 

mapSPOUSES:   //We build the map for each employee who has espouse and assign “Spouse=Y”.
MAPPING
LOAD 
     EmpId,
     ‘Y’ as Spouse
Resident Spouses;

Drop Table Spouses;

// Apply the map replacing EmpId field by what the map returns
// If EmpId is founded in the map put a “Y” if not put a “N”

tmpEmployees:
Load
     EmpId,
     EmpName,
     applymap(‘mapSPOUSES’,EmpId,’N’) as Spouse
RESIDENT Employees;

Drop Table Employees;
RENAME Table tmpEmployees to Employees;

 Thereby we obtain in a single pass: quinta

Usando “Exists”:

To use this technique (in this example) the first thing to do is rename the field EmpId to EmpId2 in table Spouses:

Spouses:
Load * inline [
SpouseID, EmpId2, Name ………

 Then the code:

Employees2:
NoConcatenate
Load
     EmpId,
     EmpName,
     ‘Y’ as Spouse
RESIDENT Employees
where EXISTS(EmpId2,EmpId);

What produces (only used in spousal relationship): cuarta

And here the complete code table with “N” part:

Employees2:
Load
     EmpId,
     EmpName,
     ‘N’ as Spouse
RESIDENT Employees
where NOT EXISTS(EmpId2,EmpId);

Drop Tables Spouses, Employees;
Rename Table Employees2 to Employees;

 What produce: quinta

Using “Lookup”:

Employees2:
NoConcatenate
Load
     EmpId,
     EmpName,
     if(isnull(lookup(‘SpouseID’, ‘EmpId’, EmpId, ‘Spouses’)),’N’,’Y’) as Spouse
RESIDENT Employees;

Drop Tables Spouses, Employees;

And get: quinta

 When the table’s volume is particularly large is highly advisable to use “Apply Map” for reasons of performance over any other technique, even lookup, though it shoud be mentioned that the “Exists” method is also highly performant (in my opinion even more than “apply map” ) just are quite restricted to those occasions when may be employed.

If the relationship between the tables is 1 to 0-1 or more rows:

For this we will change a little the design of the tables on which we will work, assembling tables 1-0 ratio-1-N:

Employees:
load * inline [
EmpId, EmpName
1,   Pedro
2,   María
3,   Susana
4,   John
5,   Carlos
6,  Pablo
];

Childs:
Load * inline [
ChildID, EmpId, Name
1, 1, Joseph
2, 2, Micaela
3, 2, Santino
4, 4, Mili
5, 5, Catty
6, 6, Sebastián
7, 6, Florencia
]; 

Using “Left Join” con Group By:

Here the idea is to get a single row in the Childs table for each Employee, adding some information, in this case by counting the number of children:

Left Join (Employees)
Load
     EmpId,
     Count(ChildID) as Childs
RESIDENT Childs
Group By EmpId;

So get: sexta
And if we want to convert null to 0 (zero), add:

tmp_Eployees:
Load
     EmpId,
     EmpName,
     if(isNull(Childs),0,Childs) as Childs2
RESIDENT Employees;

Drop Tables Childs, Employees;
RENAME Table tmp_Employees to Employees;

 With what we get: septima

Using “Left Join” with “Select Distinct”:

The idea here is to get a single row back to join with the Employees table and thus no duplicate rows. But here instead of by using Group (which has very poor performance in large volumes) will achieve a single row per employee as follows:

Left Join (Employees)
Load DISTINCT
     EmpId,
     if(not isNull(ChildID),’Y’) as Childs
RESIDENT Childs;

The limitation of this technique as opposed to the “Group By” is that we can’t receive quantities or other aggregations (eg:. Sum, concat, etc) but it limit ourselves to get some kind of flag with ALWAYS the same value, just for the DISTINCT let only one row per employee, obtaining: 

octava

And if we want to convert null to 0 (zero), add:

tmp_Eployees:
Load
     EmpId,
     EmpName,
     if(isNull(Childs),’N’,Childs) as Childs2
RESIDENT Employees;

Drop Tables Childs, Employees;
RENAME Table tmp_Employees to Employees;

Obtaining: novena

 Using “LookUp “

Employees2:
NoConcatenate
Load
     EmpId,
     EmpName,
     if(isnull(lookup(‘ChildID’, ‘EmpId’, EmpId, ‘Childs’)),’N’,’Y’) as Childs
RESIDENT Employees;

Drop Tables Childs, Employees;

And we get the desired result in one pass:

novena

Using “ Exists”:

We apply the same technique explained above:

First we change the name of the field EmpId (to EmpId2) in the table Childs:

 Childs:
Load * inline [
ChildID, EmpId2, Name …

 And then the code:

Employees2:
NoConcatenate
Load
     EmpId,
     EmpName,
     ‘Y’ as Childs
RESIDENT Employees
where EXISTS(EmpId2,EmpId);

Employees2:
Load
     EmpId,
     EmpName,
     ‘N’ as Childs
RESIDENT Employees
where NOT EXISTS(EmpId2,EmpId);

 Obtaining in two pass: 

novena

 At last using APPLY MAP:

We construct the map where each employee who has children and assign Child = Y. 

mapChilds:  

MAPPING
LOAD 
     EmpId,
     ‘Y’ as Child
Resident Childs;
Drop Table Childs;

And apply the map replacing EmpId field by what the map returns. If EmpId is found returns “Y” and if the map has not the relation  returns a “N”.

tmpEmployees:
Load
     EmpId,
     EmpName,
     applymap(‘mapChilds’,EmpId,’N’) as Child
RESIDENT Employees;

Drop Table Employees;
RENAME Table tmpEmployees to Employees;

Thus obtaining a single row for each employee because MAP APPLY only returns the first match for each row. In this sense it works like a Select DISTINCT:

novena

Acerca de pparnisari
Casi 44 vueltas al sol. Licenciado en sistemas. Curioso, investigador, excelente para resolver problemas prácticos. Casado, 2 hijos, cada vez con mas preguntas y respuestas mas en duda. Almost 44 laps to the sun. System engineer. Curious, researcher, great for solving practical problems. Married, 2 children, each time with more questions and more answers in doubt.

Deja un comentario

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s

A %d blogueros les gusta esto: