ETL and Data Quality

Lookup Error (Spelling mistake causing 90 record losses)

liamptk — Mon, 28 Mar 2016 06:42:04 GMT

I'm performing a lookup for a primary key, but one of the values is spelt differently and causing 90 records to error. The simple solution is obviously for me to replace the spelling within the source document. However, this is an assignment so I'm curious to know if there is another (professional) fix within the SSIS project.

Any suggestions?

Regards,
Liamptk

estimate of the approximate size of the data mart

liamptk — Mon, 14 Mar 2016 19:04:55 GMT

If there are 250,000 sales records in a data mart; how would you estimate of the approximate size of the data mart and each of its tables?

Regards,
Liamptk

SCD Type 2 dimensions and facts

rajeshwarr59 — Sat, 05 Mar 2016 01:13:09 GMT

I just started onto DW/ETL and need some clarification on implementing SCD type 2 dimensions and loading the fact tables. I am working on developing ETL for few dimension tables/fact and the table structure of each of the dimensions looks like below: Product Dim: Prod_Key(SK), prod_num(natural key), category, name etc, eff_dt, expir_dt, cur_row_ind Sales Fact: Sales_key(SK),cust_key(fk), product_ key(fk), ord_key(fk), date_key(fk), quantity sold etc The dimensions that I have are all type ...

Migrating the code from dev to test and other environments

rajeshwarr59 — Thu, 03 Mar 2016 06:06:01 GMT

Dear Experts,

I am trying to get some information on how do you typically handle the migration of code from one environment to another. Right now, we do manual export and imports from dev to qa and from QA to Production.

Does anyone have any experience/ suggestions around how can this be automated? Any helpful links that provide detailed information on this topic that you are aware of?

Thanks in advance for the help.

Data Architecture for Single Source System (Normalised).

MarkW — Thu, 18 Feb 2016 15:18:50 GMT

Hi, I am working on a project to implement a reporting platform, consisting of what should be a hosted data warehouse through a supplier/agency. They are also developing the data architecture whilst we are under project. When the reporting platform goes live, I will be maintaining, supporting, developing the platform, with a view to expand our in-house function into a team of BI developers. The source system is accessible in the following ways based on the suppliers proposal for the three ...

Need suggestion on which approach to use

ppdatawarehouse — Wed, 03 Feb 2016 21:58:15 GMT

I need suggestion on what approach to be used in terms of tools and technologies. We have TERADATA DB and VERTICA DB, we need to join few tables from both DB's and store it in VERTICA. The joins are complex and we have many data transformations to be done. Data volume is very high in TB for an hour. Pls suggest what will be the best tool to be used here.

ETL Execution & Scheduling

ETLDev — Wed, 03 Feb 2016 19:18:56 GMT

I had a question about the ETL design. Given that ETL would involve tools/scripts/programs written in different programming languages and platforms, what platforms/tools are used to manage the execution and scheduling of these? How do we make sure that these tools/scripts/programs run at the same time and pull data into the data warehouse?

Thanks,
Amit

DATA MAPPING

kball15 — Wed, 03 Feb 2016 07:35:15 GMT

Please can someone help me with a sample document of data mapping(dimensions and facts),
The task I am to do is for an existing Data warehouse, I am develop something like a data dictionary of the dimensions and fact in terms of how their attributes are derived, I have looked online and have seen some basic table type. I am just checking if there is a better standard template.

Thank you

Current vs history view of datamarts

rajeshwarr59 — Wed, 06 Jan 2016 02:53:45 GMT

Hello,

I am pretty new to data warehousing. I have been going through the Kimball forums and noticed a few posts about current view vs. history view and curious as to what's the difference between these two. I feel like I kind of get this, but getting a mixed up when I start going into details. Could someone help explain me the difference between these two concepts with an example since it helps me kind of understand clearly?

Thanks in advance.

ETL incremental load strategies

juanvg1972 — Wed, 30 Dec 2015 18:36:57 GMT

Hi, I would like to know which is the best strategy in a ETL incremental load. I have a CDC system that capture changes that I read, validate, transform and load in the target DB. These changes belong to a period, all the rows that I treat correspond to a month (for example in Dec I treat November rows, and in January I will treat Decemeber rows). For example in December, I am treating November rows and for the loading I have to optines: 1) In every ETL incremental load delete all ...

Data Stores on Separate Databases within DW.

MarkW — Fri, 04 Dec 2015 20:05:51 GMT

Is there an argument to have separate databases within a data warehouse for the following data stores: Staging Normalised Data Store Operational Data Store Dimensional Data Store Conversely, is there an argument to mix your operational data store with your dimensional data store for example? So, you would end up with a database with a mix of tables you would normally expect to find in a ODS with fact and dimension tables, which you would expect to find in a DDS? Our DW is hosted ...

Difference between NDS and ODS.

MarkW — Fri, 27 Nov 2015 20:54:00 GMT

Hi, I am a little unclear as to the difference between a normalised data store and an operational data store. I thought that the main difference was that a NDS was an internal data store (not seen by users/applications), and an ODS was a hybrid data store, used as both an internal store before being moved to a dimensional data store, but also one that could be seen by users and applications. I have now read that an ODS does not contain any historical data, and contains only current transaction ...

SCD using MERGE statement

Straightdrive — Thu, 26 Nov 2015 10:55:52 GMT

Thanks to Warrent's Design Tip #107, it is very clear how to implement SCD Type 1 and Type 2 using the MERGE statement. However, if a dimension has both Type 1 and Type 2 attributes, we would need to run the MERGE for Type 1 separately, followed by MERGE for Type 2. Is my understanding correct? Is there anyway these 2 can be combined?

Thanks.

Stuck with Stored Procedures

grinnell — Thu, 19 Nov 2015 13:59:09 GMT

Hello, I have two data sources which will ONLY allow me to access their data via stored procedures. What's worse, these stored procedures return results based on a single ID passed in rather than as a set. In addition, they do not have any dates or times data was last updated that are exposed to me as a consumer. Therefore, I pretty much have to pass in all my IDs to return every single record they have in order to cross-reference that with my warehouse data to see if I need to update ...

Loading DataMart Current and History

lourthad2 — Thu, 05 Nov 2015 01:22:33 GMT

Hello Group, I am trying to load the historic data into data mart, currently the data mart is being loaded in prod. We want to load the history data into a pre prod environment and merge them into prod environment. We have done something before for 2 years worth of history by defining surrogate key range and then merging with prod. We are trying this approach for different requirement which has history of 20 years, trying to check if there is a better approach to do it than using surrogate ...

Extraction layer on SOA/ESB architecture

Tenrinho — Tue, 27 Oct 2015 18:12:24 GMT

Hi, I'm going to start to develop a Data Warehouse on a company who is starting to move the different systems to a ESB, basically adapting a SOA architecture (simplify and split large services to small services and get them "talking" with each other through the ESB with SOAP messages). I wanted to know, based on your professional experience what are usually the best practices regarding the extraction layer: 1-Where should I get the data from? from the ESB or connect directly to service ...

Fact Table Load Question...

GradStudent2015 — Tue, 27 Oct 2015 12:30:01 GMT

Hello everyone, I have a question and I apologize if its a stupid question but I'm not a DBA/Data Warehouse developer, I come from the BI end user side and I'm very new to this. Say I have a standard star schema with 3 Dimensions (Automobile Orders, Time and Product| Time and Products are hierachical) and a Fact table. My question is, when I create the Fact table, traditionally I would have a FK that joins it to each of the Dimensions. For simplicity sakes my Auto Orders Dim table has ...

need solution for IIel load late arriving dimension

Iamundercover — Mon, 26 Oct 2015 20:23:03 GMT

Hello, Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool. I need to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc�) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system =account number). The trouble ...

Too Many SSIS packages to manage!

Remark — Mon, 05 Oct 2015 10:24:09 GMT

My data warehousing team currently supports over 700+ ssis packages in production. We have your typical landing zone, staging area and production data warehouse. We also have a user testing environment which replicate the production jobs. Although we have package templates and some reusable ssis packages, we still find ourselves creating new ssis packages each week (despite my protest). As you can imagine, this great number of package are difficult to manage and is ultimately unsustainable. ...

Anti-aliasing time series data in a data warehouse?

rupertlssmith — Fri, 11 Sep 2015 19:03:31 GMT

I'm not sure that "anti-aliasing of time series data" is the correct terminology to use, so let me explain: I have some sources of data that are aligned quarterly, mostly to do with quarterly running costs. I have some other sources of data, some of which are aligned weekly, and some monthly, and these are mostly related to transactions that took place. By the term 'anti-aliasing', I am referring to the problem of how to represent this data on a common time granularity, so that it can all ...

ETL Testing coverage

Henar Safwat — Sun, 06 Sep 2015 16:47:10 GMT

i'm working on a testing project for ETL ,and i need to be sure that my test cases covered most of the issues in the project.Is there anyway to check this.

ETL Automation test

Henar Safwat — Thu, 03 Sep 2015 14:49:58 GMT

i'm about making a survey about the data warehouse tests ,and what is the most type that needed to be automated and how to automate it ? anyhelp please ?

Full history staging tables

ryno1234 — Wed, 19 Aug 2015 05:24:00 GMT

In my ETL process, I typically move data from my source system(s), to a staging table of some sort and then from the staging tables into my dimensional model. My routines run nightly and because of that, my highest level of resolution on data is 1 day. I detect daily changes only. The source systems tend to not have very good history tracking / date tracking of modifications, so I've relied on simply bringing source-system data in and comparing it to my staged data to see if there are any ...

Bridge tables

rajeshwarr59 — Wed, 22 Jul 2015 04:03:05 GMT

I am trying to load a bridge table and having difficulty coming up with the logic to implement this. I did some research online but couldn't understand it. Can anyone provide a sample stub that can help me understand how do we usually implement this bridge table population(using listagg order by etc). I think a while back I was able to find a query online but lost the link unfortunately.

Help required to reload data in star schema

gokul_ifs — Mon, 13 Jul 2015 12:44:02 GMT

Hi, I am new to data warehouse technologies. Looking for exports help to understand how to load data for the scenario described below. In my DW system, I will be getting files on daily basis which contains sales information, and I will be loading these files to star schema. Let�s says, I have received to files on in the past 10 days and loaded all to files to the data warehouse. I am performing SCD type 3 load with flag variable. Now, on Day 11, I have to undo all the changes done ...

ETL: process row by row with pipeline between steps or process the whole datasets in every step?

juanvg1972 — Sun, 21 Jun 2015 00:13:09 GMT

I have a doubt about ETL process performance. I have always believed that the better way to process a dataset in a ETL process in term of performance is to process the whole dataset in every step and pass the completed and transformed dataset to the next step of the flow. I have worked like that in SAS (typical sas data steps), but working in Pentaho DI, I can see that the rows flow to the next setps without being completed the whole dataset, is that good for performance?, in that case ...

handling cases in fact table

NewDpr — Sun, 21 Jun 2015 00:04:53 GMT

Hello All,

I would like to know the best way and if possible automatic formula to handle the case : late arriving facts ?
Notice that I am not using an etl tool, only via stored procedure in pl/sql.

Thanks a lot for your help!

Near Real time ETL and ETL cloud

juanvg1972 — Sun, 07 Jun 2015 22:55:16 GMT

Hi, I have read that the principal changes of Data Integration process in the near future are: real time ETL and cloud ETL. I have many doubus about that questions, becaurse many people talk or read, but few people talk about real experiences or real projects.... I write here my doubts asking for help: 1) real time ETL: are we talking about CDC systems?, several micro-batch process in a day?, what more changes are needed in real time ETL?. What do I have to consider in my ETL process ...

Testing ETL pipeline(s)

babalu — Fri, 05 Jun 2015 22:34:51 GMT

Having read Data WareHouse Toolkit, I'm looking for some resources for additional information regarding Testing of ETL pipeline(s) for performance, data fidelity and data model consistency.

ETL Fragility

ryno1234 — Thu, 28 May 2015 20:25:16 GMT

My ETL process is built upon jobs strung together in sequence ran nightly (once every 24 hours). Our nightly routine kicks off and runs the jobs one after the other. Upon a failure, the entire process stops, logs the failure in a DB and will attempt the same thing the following night. The *hope* is that during the day following a failure, the failure is investigated, fixed, and the failed jobs are manually re-ran ASAP. This entire process seems very fragile: As there becomes more and ...

Looking for suggestions on ETL tools for DW/BI Project

GregDC — Thu, 07 May 2015 12:46:47 GMT

Greeting again,

I am back with a very basic question: Given a DW with conformed dimensions, currently 4 factless fact tables, and Dimensions with a few SCD 1 and 2 attributes, is there a good set of tools to help in the ETL for the Warehouse on the market today? We are currently looking to build the DW on a Microsoft Server in the cloud somewhere with the Operational Data Stores possibly on local machines.

What is datawarehouse automation?

juanvg1972 — Wed, 06 May 2015 18:09:29 GMT

Hi Ajulius,

Can you expliain what is exactly datawarehouse automation?

Is the ETL script generated automatically?, what about complex business rules?

Thanks,

ETL Tools please

TheNJDevil — Wed, 06 May 2015 03:10:45 GMT

Are their any ETL tools that will provide the most flexibility with the least amount of custom coding (as in little to none) out of the box? �I'm just looking for a list of tools that others have used that didn't require a lot of custom programming to get enterprise quality ETL processes (a large variety of sources, audit capability, possibly data lineage, ect).

Any assistance would be great. �Thanks.

ETL design and performance questions

juanvg1972 — Tue, 05 May 2015 21:45:07 GMT

I have some questions about ETL design and process optimization: 1) CDC systems: I know CDC systems used in ETL process that read the database log (for example Oracle redolog) in order to get the changes and create a input to the ETL process, I suppose there are more CDCsystems, which is the best one?, whar are the differences? 2) indexes: I have read that when you are making a load to database in a ETL process is better to disable indexes and enable it after loading process. is right?, ...

ETL Fact Load in SSIS

piyushtamaskar21 — Tue, 28 Apr 2015 06:29:19 GMT

We are well aware with the fact that we always load dimension tables and then fact tables, The tool I am using is SSIS. Ideally we use transaction table and do lookup on dimension and make the entry of that particular row in the destination i.e Fact table. Likewise we load the data in fact table. What if in our case we don't have transaction table and we still want to load the fact table . What possible approaches should I implement ? Please Reply Thank-You !

HELP WITH SCRIPT COMPONENET, SSIS

sssqllearner — Wed, 25 Feb 2015 17:46:01 GMT

HI, I HAVE SCRIPT WHICH GIVES ME "AREA 3" INFO IN MY STAGING TABLE. HOWEVER I HAVE REQUIREMENT NOW WHERE I NEED TO INSERT AREA 1 AND AREA 2 FROM BELOW TABLE AS WELL IN THE STAGING TABLE, CAN ANY ONE PLEASE HELP me TO UPDATE THE SCRIPT SO THAT I CAN INSERT AREA 1 AND AREA 2...... I am not sure how to use script component yet and hence i would really appreciate if someone can guide me. CURRENT SCRIPT IS: �' Process Row 3 - Skip row that has first column with a single ...

Poorly sturctured data at source system

JSchroeder — Thu, 12 Feb 2015 06:19:16 GMT

If the data in the source system is poorly structured and not normalized in any way, what is the most typical way to handle this? Is it common to try and normalize the data as it moves through the staging area before pushing it out into a dimensional data mart? We're pulling data from one of our own websites that is stored in SQL server, but it isn't really structured in any way. Would the right approach be to try and normalize it first?

SQL Visualization -- donate SQL code (tech startup asks for help)

sqldep — Fri, 06 Feb 2015 00:50:22 GMT

Hello- for the past 18 months we have been coding daily (yup, including weekends) to solve a problem we had as data-warehouse developers. We were annoyed by manually tracing data-lineage inside SQL queries. What used to take us 30 minutes to an hour, can now be done in milliseconds. Now, we need your help. We are seeking SQL queries from 25 DWH teams to run our data-lineage visualization engine on a real-life ETL processes. Your code base will help us finalize the service prior to releasing ...

Date Dimension future flag Update

sssqllearner — Wed, 04 Feb 2015 10:39:49 GMT

Hi, we currently have monthly report that shows sales for last completed month. The sql server agent job is schedule to run on first working day of the month. So when it ran yesterday it failed because the month ended in weekend. In our date dimension table we have column futureperiod where we set value either 0 or 1. 0 = historic months and 1 = future months. we have a sp [etl].[SetCurrentDate] where we have: UPDATE etl.DimDate SET FuturePeriod = 0 WHERE DateID @CurrentDateID; IS ...

High % of NULL's - yet they want it!

hortoristic — Mon, 02 Feb 2015 19:32:58 GMT

I have a customer that when I do the profiling of the data source, some of the columns are between 90% to 99% NULL and they insist there could be value in bringing them into DW. Specifically to look at the 1% to 10% outliers. This sort of goes against my belief of what a DW should be for. A peer of mine working at another institution pointed out that if they think they need it and are really pressing for it, they are likely to get it one way or another and I should just give them what they ...

data puzzle - looking for a query to solve this

topcat — Mon, 12 Jan 2015 16:33:58 GMT

i have rows in a table that i need to convert to a type 2 dimension, looking for a query to do this (in bulk, no cursors, looping). simple thought is to group on value and take min/max of date, but the values can and will repeat. current data is rows with a date and a value, assume each day has a value. ie date, value 01/01/15, 10 01/02/15, 10 01/03/15, 10 01/04/15, 10 01/05/15, 12 01/06/15, 12 01/07/15, 12 01/08/15, 09 01/09/15, 10 01/10/15, 10 want to convert this to value ...

ETL Technical Metadata Tools ?

Henar Safwat — Mon, 12 Jan 2015 01:19:33 GMT

I'm asking about any tool that can read the designed packages metadata and store it in a database.I need to capture the data Lineage and mapping between columns ?

Late Arriving Dimension Data

GregDC — Sat, 10 Jan 2015 20:06:51 GMT

I have a Warehouse with a grain of monthly. Data from the HR department is reported on as of the last working day of the month. However we are being told that are counts by status are off. It would seem that on the month after the report is issued that HR provides records with status changes that are backdated to the prior month. So when HR looks at the counts for the July Status report they are correct according to the data in the dimension/fact tables at the time the report was run, but ...

10 Data Integration Predictions for 2015

David_Mai — Mon, 05 Jan 2015 19:22:04 GMT

Hi, I'm a writer for Solutions Review and I've compiled a list of Data Integration industry predictions for 2015 in the links below. Please read and leave your thoughts. Thanks. http://solutions-review.com/data-integration/10-data-integration-predictions-for-2015/ http://solutions-review.com/data-integration/10-data-integration-predictions-for-2015-part-2/

Load Complex Excel Sheet

sssqllearner — Wed, 17 Dec 2014 16:42:01 GMT

Hi,

I am trying to understand how to load below excel book but I am unable to understand the steps. Can any one share there expertise which can help me to load this file in to SQL SERVER.

This will be a monthly file for the production. �It will be loaded into SQL SERVER On Monthly Basis.

Thanks,

Kind Regards,

excel

Null and Blank Dates from Source System

Informer30 — Tue, 25 Nov 2014 07:55:50 GMT

Hi All,

I wanted to know what is the recommendation for dates in the DWH where they are feeding through
from source system as nulls or blanks?

Currently we do not have a consistent solution as some of the date columns are used in a composite key
and where they feeding through as a null or blank they converted to 01/01/1900.

Thanks

DW/BI QA and Testing

apc — Mon, 24 Nov 2014 09:11:30 GMT

Hi all, Can anyone give some tips on QA and Testing on DW/BI? I already read the Kimball ETL book that talks a little about the subject but I would like to have a more QA specific view on the subject. What would be your main concerns when testing a DW/BI application? I think testing continues to be not so popular in the DW/BI realm. What types of tests or QA best practices from common Software Engineering Development make sense to apply on DW/BI? Any resource will be greatly appreciated. Thanks!

Integration testing

pallavi — Thu, 20 Nov 2014 14:18:28 GMT

Hi all, I am new to ETL testing and have some doubts which i want to post here. Kindly excuse any errors made by me. 1. How do we test data which has cardinal mappings between fields. Say for instance, the fields present in Source file are mapped to diferent fields in diferent source files. When data in dropped through ETL in ODS or a datamart, how do we test the data and mapping. For example, we have files with Customer information, which have more than one account, loan account. The account ...

ETL Design Problems for Real time

sky87 — Fri, 19 Sep 2014 10:49:04 GMT

Hi at all I agonise my head in designing an ETL for a real time DWH. It is not so clear for me, what I have to do in which steps and i hope that you can help me a bit. I tell you something about my initial situation. I have two Source Systems, which are independent to each other and are used by different user applications. I have to design a DWH for Reporting issues. Some data in the source systems are the same. I will give you an example. Table 1: Productnumber| �Description| ...

Budget at month level

sssqllearner — Tue, 16 Sep 2014 19:21:34 GMT

Hi, This is my first post so apologies if I am approaching in incorrect way. I need some help in below development:- Currently we receive targets at cycle level (a cycle is of 4 months i.e.. 01/01/2014 to 30/04/2014). However there is requirement from client to show these targets at month level. Month level data will be based on Actual Working Days in a month. e.g. if a month has 26% of the working days in a business cycle (4 months period) the target (Budget) for this month should be ...