How to implement filebased filtering screens in perl?
Page 1 of 1
How to implement filebased filtering screens in perl?
I am having to recode my etl for performance reasons and am trying to figure out how to reimplement some features that were easy when it was database staged vs file staged. The original steps involved querying the source and loading it into a staging table in the database. From there I have a table that holds all the sql screens for checking errors that perl iterates over and runs the sql. Complex screens were easy because the lookup tables and dimensions were all in the database and the SQL for the screens is in the screen table. I can add or modify screens in that table without having to modify the etl source code. But for performance and restartability reasons, I am moving to a file based approach and am trying to figure out a method to recreate this. What I tried to do was change the SQL in the screens table to perl code that could be put in an eval statement.
Before:
After:
This method works great for screens that are only dealing with data in the csv file, but if I want to check if a value is in a lookup table I now have to modify the etl script to pull the lookup into an array that the screen_perl code can reference. I'm looking for something a little more elegent. I looked at using AWK and storing the awk code in the table, but the csv data contains clobs with line breaks and commas that I can't get AWK to recognize. How have others in the past done filtering and data cleaning with file based staging data? The only other option I can think of is to put the data in the database, run the screens, then dump the clean records back out for further processing.
Before:
- Code:
# example sql screen
# SELECT error_tracking_pk FROM staging_table WHERE column_1 IS NULL
sth = dbh->prepare("SELECT screen_sql, screen_pk FROM screens");
sth->bind_col(1, \$screen_sql);
sth->bind_col(2, \$screen_pk);
sth->execute();
while(sth->fetch())
{
sth2 = dbh->prepare("INSERT INTO error_table SELECT error_tracking_pk, $screen_pk, SYSDATE FROM ($screen_sql)");
sth2->execute();
}
After:
- Code:
# example perl screen
# not(defined $data_hash_ref->{column_1}) && $data_hash_ref->{column_2} >= 4
sth = dbh->prepare("SELECT screen_perl, screen_pk FROM screens");
sth->bind_col(1, \$screen_perl);
sth->bind_col(2, \$screen_pk);
sth->execute();
while(sth->fetch())
{
push(@screens, [$screen_perl, $screen_pk]);
}
while( iterating over the csv records )
{
foreach my $screen ($screens)
{
if (eval ($screen->{screen_perl}) )
{
Insert this record into error table
}
}
}
This method works great for screens that are only dealing with data in the csv file, but if I want to check if a value is in a lookup table I now have to modify the etl script to pull the lookup into an array that the screen_perl code can reference. I'm looking for something a little more elegent. I looked at using AWK and storing the awk code in the table, but the csv data contains clobs with line breaks and commas that I can't get AWK to recognize. How have others in the past done filtering and data cleaning with file based staging data? The only other option I can think of is to put the data in the database, run the screens, then dump the clean records back out for further processing.
mugen_kanosei- Posts : 13
Join date : 2009-02-03
Age : 43
Location : Japan
Similar topics
» Report filtering on Facts
» How to implement SCD3
» When to implement cleaning?
» ETL Auditing standards and how to implement them.
» Where to implement SCD type 2 logic?
» How to implement SCD3
» When to implement cleaning?
» ETL Auditing standards and how to implement them.
» Where to implement SCD type 2 logic?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum