Loading and cleaning "randomly" updated, timestamped data from a 3rd party
2 posters
Page 1 of 1
Loading and cleaning "randomly" updated, timestamped data from a 3rd party
Hi,
I have a datasource which is a 3rd party api which gives us basically a ton of atom feeds. I'm not sure if this is a problem in ETL or a problem with my design, but I'm leaning towards solving it by loading/cleaning. Anyhow...
We get data on videos from a variety of 3rd party sites detailing how many times they've been viewed (among other things, but let's just focus on views). It's a rather simple dw design; there is a date dimension and a time-of-day dimension (time-of-day analysis is *extremely* valuable to the customers, even though it's not always possible). The problem I have is the data feeds we get are somewhat randomly updated, and they only tell us the total views. The feed could be updated every few seconds, or it could go 5 days without updating. Now, we store the last-recorded values along with a timestamp. My problem is, if there is a lapse of a few days before a feed is updated, what do I do in terms of time-of-day and loading in facts? Currently I split up the range if it crosses day boundaries and insert a fact for each interval.
Since we've added time-of-day, this obviously complicates things. It feels like I need to represent time intervals, but I'm not 100% sure how to do that, and it seems like drilling down and viewing by day (which is standard in this industry) would be complicated.
I guess my options are:
1. Change my design?
2. Clean this by cutting up day-spanning intervals by day and point them to an "Unknown" or "All" record in the time-of-day dimension (what I'm doing now, plus the "Unknown" time-of-day, which will be necessary anyways). This has the unfortunate effect of obliterating time-of-day analysis any time we get a delta spanning more than our time-of-day grain.
3. ??? Something clever I haven't thought of?
I think at the root of it all, if I had better/more-consistent source data, the design would look remarkably like the simple retail sales model (where we have "views" instead of items sold). But I feel like I should keep my facts sparse. Since I can sometimes only get a delta every few days that represents a variable interval of time, I have to either fight this by loading in tons of redundant facts (averaging out that delta over my time grain) or solving it with some heavy lifting in query tools (which feels wrong as well).
Any tips?
I have a datasource which is a 3rd party api which gives us basically a ton of atom feeds. I'm not sure if this is a problem in ETL or a problem with my design, but I'm leaning towards solving it by loading/cleaning. Anyhow...
We get data on videos from a variety of 3rd party sites detailing how many times they've been viewed (among other things, but let's just focus on views). It's a rather simple dw design; there is a date dimension and a time-of-day dimension (time-of-day analysis is *extremely* valuable to the customers, even though it's not always possible). The problem I have is the data feeds we get are somewhat randomly updated, and they only tell us the total views. The feed could be updated every few seconds, or it could go 5 days without updating. Now, we store the last-recorded values along with a timestamp. My problem is, if there is a lapse of a few days before a feed is updated, what do I do in terms of time-of-day and loading in facts? Currently I split up the range if it crosses day boundaries and insert a fact for each interval.
Since we've added time-of-day, this obviously complicates things. It feels like I need to represent time intervals, but I'm not 100% sure how to do that, and it seems like drilling down and viewing by day (which is standard in this industry) would be complicated.
I guess my options are:
1. Change my design?
2. Clean this by cutting up day-spanning intervals by day and point them to an "Unknown" or "All" record in the time-of-day dimension (what I'm doing now, plus the "Unknown" time-of-day, which will be necessary anyways). This has the unfortunate effect of obliterating time-of-day analysis any time we get a delta spanning more than our time-of-day grain.
3. ??? Something clever I haven't thought of?
I think at the root of it all, if I had better/more-consistent source data, the design would look remarkably like the simple retail sales model (where we have "views" instead of items sold). But I feel like I should keep my facts sparse. Since I can sometimes only get a delta every few days that represents a variable interval of time, I have to either fight this by loading in tons of redundant facts (averaging out that delta over my time grain) or solving it with some heavy lifting in query tools (which feels wrong as well).
Any tips?
ianschenck- Posts : 1
Join date : 2010-05-15
Re: Loading and cleaning "randomly" updated, timestamped data from a 3rd party
You've pretty much hit on all your options. The appropriate one would be the one the business wants. You really need to discuss it with them. Put together a report, lay out the options, make a recommendation and let them choose.
Similar topics
» Loading data in Data Warehouse
» Suggestions for cleaning data
» Data Cleaning - replacing nulls and addresses
» Incremental Loading of data
» Loading Duration Data
» Suggestions for cleaning data
» Data Cleaning - replacing nulls and addresses
» Incremental Loading of data
» Loading Duration Data
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum