New to BI and need a little help understanding the basic concept

Spiralmatrix Tue Apr 27, 2010 11:55 am

Hi,

I apologize in advance if you lose the will to live with my question or if you think it is a stupid one but I am very know to BI with limited SQL knowledge, I have be tasked with developing a BI solution which on the face of it seems small scale just to analyse sales data.

I have several data tables for order header, order details, stock, customer and understand I need to create a fact table. On reading the 'Data warehouse ETL Toolkit book' and getting to page 162 the table there says create Surrogate keys for primary keys on each dimension and link them as foreign keys in the fact table for each dimension along with the facts. I can get my head round this my questions are as follows:-

the integer for each PK in each dimension when importing the table data for each dimension should the integer start at 1 for each row of each dimension or is the idea import and have a unique integer across all dimensions?

and do I replicate in the fact table under each FK for each dimension the integer so that the FK can reference the PK correct?

Like I said this might be a very stupid question but I need to get the concept clear in my head so it all makes sense.

Thanks in advance for anyone’s help

ngalemmo Tue Apr 27, 2010 1:51 pm

There are no stupid questions, only stupid answers...

the integer for each PK in each dimension when importing the table data for each dimension should the integer start at 1 for each row of each dimension or is the idea import and have a unique integer across all dimensions?

The thing with surrogate keys is, it doesn't matter how you get one, as long as it is unique for the table. So either way will work. Some databases, such as SQL Server have an 'autonumber' data type that will generate the value for you.

and do I replicate in the fact table under each FK for each dimension the integer so that the FK can reference the PK correct?

The basic load process is to first load dimensions (add new rows, update old ones) then load facts. The fact data will contain natural keys (such as customer number) that you use to locate the correct row in the corresponding dimension to get the surrogate primary key. A fact table consists of a bunch of foriegn keys (i.e. the surrogate primary keys from the dimension table) and measures.

Spiralmatrix Tue Apr 27, 2010 2:21 pm

Thank you very much for helping so quick, I have basically done what you have suggested.

I have been given the data in character delimited txt files for the Order Header, Order Details, Stock, Customer and sales Reps.

I have imported this data into SQL2008 as separate tables for each one, then created a second database with tables matching the first, then in the second one I have create an auto increasing integer in each table and made it my PK, then I have in the Order header table and order detail tables created a column in each based on a combination of columns to make a unique entry in each table. for example while in the Order header the order number is unique I have still created a PK based on auto increasing integer then imported the table data from the first database to this new database. On the Order details I have taken the same approach for the PK for NK I have had to create a column combined with OrderNo+OrderLineno+warehouse and product so as to make a unique column.

my grey area and confusion lies in building the fact table, I understand I need a FK for each dimension in the fact table as well as the facts, but currently each dimension table PK starts at 1 and counts up, my theory, right or wrong, most likely wrong.

Import the Order details integers from its PK to the fact tables first dimension FK as the first dimension, then re-run the process for the order details dimension as the second dimension and populate the second dimension FK in the fact table with the correct integer in relation to Order Details dimension based on if the first Dimension Key of the Order header matches the order number data in the Order Details dimension.

Then run the process again for the stock dimension and after that for the Customer dimension and then Sales Rep dimension.

That kind of made some sense at first, then i thought what if I have the idea all wrong, as if I build a date table and then import that into the fact table with the same PK integer then how on earth do I relate the date dimension PK data to the FK dimension in the Fact Table.

That is where I think thought perhaps every transaction in every dimension table should be unique, so say PK int of Order header runs 1 - 4000 , then Order Details 4001 - 35000 , Stock 35001 - 35800, customers 35801 - 36200, sales reps 36201 - 36280 and the date 36281 - 40280.

Hopefully the above explains where I am, and I am wondering if I have lost the plot, missed the concept or just getting the wrong end of the stick.

The more I read the books I have the more i go round in circles with what I think is right, then think it must be wrong. I know once I have the correct concept in my head the rest will fall into place but just need to understand in my head the mechanics of how this part works.

ngalemmo Tue Apr 27, 2010 4:10 pm

I have imported this data into SQL2008 as separate tables for each one, then created a second database with tables matching the first, then in the second one I have create an auto increasing integer in each table and made it my PK, then I have in the Order header table and order detail tables created a column in each based on a combination of columns to make a unique entry in each table. for example while in the Order header the order number is unique I have still created a PK based on auto increasing integer then imported the table data from the first database to this new database. On the Order details I have taken the same approach for the PK for NK I have had to create a column combined with OrderNo+OrderLineno+warehouse and product so as to make a unique column.

I think you need to back up and rethink your dimensional model. You need to separate facts from dimensions and identify what the dimensions are. Facts, for the most part are probably going to come from the order detail table (qty sold, extended price, etc...). Your dimensions are going be date (probably more than one), customer, product, and a host of other attribues. These dimensions may or may not directly relate to a source table. For example, it is not common to have an 'order' dimension in a sales star schema. Often the order number is simply a column in the fact table (degenerate dimension) and order header level attributes are distributed into other dimensions (sales channel, order type and so on).

Your fact table may look something like this:
order_date_key (fk)
bill_to_customer_key (fk)
sold_to_customer_key (fk)
sales_channel_key (fk)
product_key (fk)
order_quantity
extended_price
etc...

You load the row in one pass with multiple steps (using SSIS on SQL Server), doing lookups to find surrogate keys, then writing the row out once.

Surrogate key values themselves have no meaning and should not be set up using some 'system'. All they do is number a row. The FK column holds that number and uses it to join to that row in the dimension table. The reason you use surrogate keys is create a fixed relationship between a fact row and a dimension row that is not broken should the business keys change. In other words, if a sale points to a specific row in the customer dimension, it will always point to that row in the customer dimension even if the customer number was changed because of a new order system.

Sponsored content

New to BI and need a little help understanding the basic concept

New to BI and need a little help understanding the basic concept

Re: New to BI and need a little help understanding the basic concept

This is what I have so far and where I am in my own mind

Re: New to BI and need a little help understanding the basic concept

Re: New to BI and need a little help understanding the basic concept