"Noisy" data

PLYounger Wed Jun 25, 2014 1:08 pm

We have some ‘noisy’ data in that changes may be made to a row multiple times a day and all those changes are required to be persisted. At any point in the day, we want to see the most recent row. I have seen this accomplished at other companies by using an effective date and expiration date. The expiration date is set to be 1 day before the effective date. So if multiple changes are made on the same row in a given day, the effective date is greater than the expiration date. At the end of the day the data would appear as below:

Create date/time id Column Value Effective Date Expiration Date
06/24/2014 08:30:00 101 CATT 06/24/2014 06/23/2014
06/24/2014 08:35:00 101 KATT 06/24/2014 06/23/2014
06/24/2014 11:10:07 101 CAT 06/24/2014 12/31/9999

The first 2 rows would never be returned in a query using "date between" syntax, but the history of what ‘was’ is not lost. Are you familiar with identifying these “noisy” rows in this manner? If so, is there any design tip / white paper which the Kimball Group has on this topic? Some of my collegues are not familar with this technique so any assistance you can give would be most appreciated.

Thanks!

BoxesAndLines Wed Jun 25, 2014 4:30 pm

Is this a dimension or fact table?

PLYounger Thu Jun 26, 2014 6:40 am

It is a dimension of HR data. Even though there may be multiple changes taking place through out the day on one employee, the users want all these changes persisted. However, the reports using this dimension only want the most current end-of-day picture of the HR data. We need to filter out the 'noise' of the erroneous/previous data that happened within the day. There are a couple of ways to do this... (including a sequence as part of the natural key for the effective date); but the easiest way I've seen it done is to always set the expiration date to be 1 day prior to the most current effective date. This ETL rule would apply to all rows being inserted (no data is updated, only new inserts).

Hope this helps to clarify my question.

Thanks :-)

BoxesAndLines Thu Jun 26, 2014 7:30 am

OK. What was throwing me off was your ID is not incrementing. If that is your SK (surrogate key) and not your NK (natural key) it needs to change for each row. You have two different problems. The first is what is your natural key for this dimension. You need to add some sort of time attribute so that you can correctly sequence the changes. For effective and end dating, I always set the end date = to the next row effective date. The other problem you have is how to get the most current row. This has been discussed extensively in many threads over the years. You can do a self join on the dimension to get the most recent row, add an additional key column to the dimension that represents the NK of and copy to your fact table, or create an additional type 1 dimension.

PLYounger Thu Jun 26, 2014 7:51 am

Thank you for all your feedback :-)

The natural key is the ID that I'm showing (i.e. 101); I'm not showing the surrogate key in my example. The surrogate key is being incremented by 1 for each new row.

The problem I have with setting the expiration date = to the next row effective date is that I cannot use date between logic: ( where '06/24/2014' between effective date and expiration date)...I would have to use < or > logic (where '06/24/2014' <= effective date and expiration date > 'date' ). It works buts seems a bit more complicated.

By setting the expiration date to the next row effective date - 1, the 'noisy' data would never meet either of the joins mentioned above and would therefore never be returned in a query.... in essense they are filtered out which is what I want it to do.

I actually know how to handle the situation as I've done it at my previous employer; I just wanted the KG feedback on 'how' I'm proposing to handle the situation as my coworkers at my current employer are not familiar with 'noisy' data.

ThomVF Thu Jun 26, 2014 9:40 am

Quick thought:

If the business needs to see the version at a specific point in time, then you may have to add a timestamp to each of the Effective/Expire Date pair and use one-second prior instead of one-day prior.

If they only want to see the current version for a given day, then you might be all set the way you are doing it?

PLYounger Thu Jun 26, 2014 10:19 am

Thanks for your feedback, Thom.

The users want to see the most recent row at the time they run a report. So the most recent row at run-time would have an expiration date of '12/31/9999'. That is why I don't have a timestamp on the table.

ThomVF Thu Jun 26, 2014 10:31 am

If the users don't ever care about seeing the version "as of" a particular time of day (in the case of intra-day feeds), then your original design seems ok.

This is how we are doing it here. In a couple of cases, multiple ETL processes are responsible for updating the same type-2 Dimension. If this happens, the first one would have the expiration set prior to the effective. The second update would be "current", with its effective date = today and its expiration date = 12/31/9999, exactly as in your example.

This works fine for us because we are only interested in end-of-day state.

The key for our case is making sure the dimensions are updated by processes in a naturally sequential manner, so the last update reflects what the business would consider the last change.

The problem with this design is that if the first update is meaningful in any way, it probably won't ever be seen.

PLYounger Thu Jun 26, 2014 10:39 am

Thanks, Thom.

I believe you have answered my original question.

I appreciate all your feedback and insight; it was very helpful.

Thanks again!

nick_white Fri Jun 27, 2014 7:47 am

Hi - I'm a bit confused by your statement "the users want all these changes persisted. However, the reports using this dimension only want the most current end-of-day picture of the HR data".
What are the users doing with this data if not reporting on it? If you are not reporting on it what's it doing in your reporting database?
If you need to keep all the changes made in a day - presumably for auditing purposes or similar? - then I'd store them in your staging area or some other datastore. Bringing them into a dimension but never using them seems an odd thing to do, but maybe I'm missing the point?

Regards

BoxesAndLines Fri Jun 27, 2014 7:57 am

LOL! Nick, I get these nutty requirements all the time. "Hey Boxes, here's a new feed with 700 new fields. Put it in the warehouse". And that's it. I totally missed the ludicrousness of the OP's requirement.

PLYounger Fri Jun 27, 2014 9:18 am

I totally understand from where you are coming :-) LOL...but the business rule here is that data is not updated nor is it deleted; just new rows created.

There are multiple data feeds that happen throughout the day of the HR data and it is published to the datamarts multiple times as well. So the requirement is to report on the most recent data at the point of the day when a query is executed. If a subsequent feed later in the same day provides a new row, then the previous row needs to be 'expired' and a new row created from the new feed. The data is never deleted, just expired, so I tell the ETL developers to set the expiration date = effective date -1 day. When multiple updates occur on the same day, then the expiration date is prior to the effective date. I don't have a problem with this.... it is just 'noise' and granted noise does clutter the database and preferabbly should be removed. But the requirement is to keep the data, so....

I thought this was a pretty common technique and was just vetting it against this forum. My current coworkers are not familiar with 'noisy data' and wanted me to provide some feedback from others on this topic.

Thanks again.

nick_white Fri Jun 27, 2014 10:16 am

OK - but I still don't get why you are loading data into the data warehouse that will never be used.

Ideally, just run a single ETL job at the end of day to extract the current values for any record that has changed in that day.

If you have to take a real-time feed of data from your source then why not hold it in your staging area and then run an ETL end-of-day job that takes the most recent version of each source record (or aggregates all version of a source record into a master eod record) and load that into your DW?

If you have to do real-time loads into your data warehouse then why not just create a new Dimension record for the first change of a source record per day (standard SCD Type 2 process) and then for any subsequent changes in that day update the record (effectively an SCD Type 1 process I guess).

If you have to create real-time records in your DW, you only want the end of day position and you can't do updates, then your business have given you a set of incompatible requirements and I'd go back and ask them how you should resolve the situation: asking you to create a fact record at 11am that references a dimension that holds values that will be correct at the subsequent midnight but without doing any updates is not possible, unless they are expecting you to be clairvoyant

PLYounger Fri Jun 27, 2014 11:57 am

My appologies for not being more specific...

The HR dimensions are built throughtout the day and users want to run ad-hoc queries / SAS programs against the dimensions throughout the day. However, the facts are not built until the end of the day so a valid HR row is associated with the fact.

My basic question is not whether to load or not to load these rows, or whether to delete / update existing rows.... those decisions have already been made by someone with a higher paygrade than mine :-) .... my question was the technique (setting the expiration date) I'm using to identify the expired rows.

So if I receive a newly hired associate at 8:00 am, which is effective as of 06/27/2014 (no timestamp) , it will have an expiration date of 12/31/9999. If I receive an update for that same row for the same effective date, I will need to expire the first row. The question is what date should be used for the expiration date? My suggestion was to treat it like we do any expiration logic.... set it 1 day less than the effective date of the new row.

I can see the confusion thinking the facts were being associated with a dimension which would never be returned..... sorry for omitting that piece of information :-(

nick_white Sun Jun 29, 2014 12:38 pm

OK - so going back to your original question, I would load these dimensional records with nulls in them, if you can or if they are defined as not null in your DB give both fields on every record some date that would never be used in the real world, doesn't matter what but something like 01/01/1850 would work. Then, in your daily ETL load process, when you are loading the facts, go through these dummy dimensional records and create the real end-of-day dimensional records that you want to reference in your fact tables.

However...
If it was me and I had a free choice, I wouldn't do this. A dimensional model should only be queried through one or more fact tables but from what you have said someone in your company has at some point made the decision to use the dimensional model as a dumping ground for data that doesn't belong there - so they have it in a table and can analyse it, presumably? Loading a record into dimension that subsequently doesn't get referenced in any fact tables is fine - but deliberately loading data into a table that you know in advance will never get used, and then trying to bend your dimensional model design to allow for this seems to me like a really bad idea.

If your business wants to be able to analyse each change in a source record then either dump this data into an operational data store (ODS) and let them get on with it or build it into your dimensional model to allow then to run the analysis they want e.g. create a fact table with the grain of "change of data"; this would probably be factless, you'd associate it to a HR dimensions that have date/time columns for their start/end effective date columns, you could introduce a Time Dim as well as your Date Dim plus whatever else could be used in the analysis. Then whenever there is a change in the source system(s) you create a record in your new fact table(s). You can then load just the data you need into your existing Dimension(s)

Obviously there may be reasons for your company doing what its doing that I'm not aware of - and that make sense for your company's specific circumstances - and there is probably not the will/desire to change things that, from a practical perspective, already deliver what's required - so feel free to ignore me and my "ideal world" perspective but I would recommend that, if you can, you do your best to prevent any further deviations from the standard dimensional model approach.

Sponsored content

"Noisy" data

"Noisy" data

Re: "Noisy" data

Noisy Data

Re: "Noisy" data

Noisy Data

Noisy Data

Noisy Data

Noisy Type-2

Noisy Data - Type 2

Re: "Noisy" data

Re: "Noisy" data

Noisy Data - Type 2

Re: "Noisy" data

Noisy Data

Re: "Noisy" data

Re: "Noisy" data