Generating a Junk dimension - Cross join or based on actual values in fact?

kletos Thu Feb 21, 2013 5:35 pm

Hi there,

Are there any problems if I create a junk dimension and maintain it like a Type 1 SCD, compared to creating the junk dimension as a cross join to build an exhaustive list of permutations.

The rationale for a Type 1 styled Junk dimension is that a cross join results in 10,000 rows, while a distinct of attributes in the fact table results in only 100 rows. This 100 rows may never reach 10,000 rows.

Thanks!
ML

rathjeevesh Thu Feb 21, 2013 10:12 pm

As per the kimball group advice we should not be creating the exhaustive list with the cross join, to begin with. In your scenario it is still manageable if the junk dimension will go up to 10K rows if created through cross join, but there can be cases where the distinct values for each of the columns may be little high and there can be a number of columns to be placed in the junk dimension which may lead to a huge dimension in terms of rows.

Its better to build your junk dimension based upon data realities i.e. populate them with new rows as they come from source. But if you are sure the number of rows by cross join will be at max 10K i.e. source has some restriction already enforced, distinct values for the columns are predefined then you may build the exhaustive list and may not have to worry about maintaining the ETL job to populate the junk dimension.

ngalemmo Fri Feb 22, 2013 7:35 am

rathjeevesh wrote:As per the kimball group advice we should not be creating the exhaustive list with the cross join, to begin with. In your scenario it is still manageable if the junk dimension will go up to 10K rows if created through cross join, but there can be cases where the distinct values for each of the columns may be little high and there can be a number of columns to be placed in the junk dimension which may lead to a huge dimension in terms of rows.

Its better to build your junk dimension based upon data realities i.e. populate them with new rows as they come from source. But if you are sure the number of rows by cross join will be at max 10K i.e. source has some restriction already enforced, distinct values for the columns are predefined then you may build the exhaustive list and may not have to worry about maintaining the ETL job to populate the junk dimension.

This is correct. There may be a few times when a cross-join makes sense, but most of the time, just create rows for combinations that actually exist.

chade25 Mon Feb 25, 2013 11:40 pm

I like to test with:

select count(*) from (select distinct [column] from [table]>);

and in the case that it is customer ids, then leave it out of the select to get the distinct values of what would be the junk dimension and add columns in to the inner select and see what ones make it sky rocket to get an idea of the best combo.

I did that today for a demographic junk dimension and it was only 1037 for every combo! I was very happy. But I want to keep even the old values because the source system just stomps over them. So if I were to say just truncate the table and recreate it, i would lose accurate demographic history, because some combos would not link to that person...so add it as it comes to maintain accurate history on the fact table at the time of the event.

Sponsored content

Generating a Junk dimension - Cross join or based on actual values in fact?

Generating a Junk dimension - Cross join or based on actual values in fact?

RE: Generating a Junk dimension - Cross join or based on actual values in fact?

Re: Generating a Junk dimension - Cross join or based on actual values in fact?

Re: Generating a Junk dimension - Cross join or based on actual values in fact?

Re: Generating a Junk dimension - Cross join or based on actual values in fact?