SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
4 posters
Page 1 of 1
SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
We are converting to SQL Server 2008 and plan to use the data Compression feature. As I understand it, the Compression method employed in the data compression removes trailing spaces in fixed length fields (this is the Row compression memthod, which is also performed by the Page Compression method). I am under the impression that the compression treats a CHAR(20) like a VARCHAR(20). No none issue in Fact Tables, but could it be an issue for Dimension Tables with Type 2 SCD in a Fixed Length field. I was taught to use Fixed length fields with Type 2 SCD because they were a more efficient data type from a read perspective (less efficient storage). If, for example, a Type 2 field had the value "Jef" in a row, which was updated to "Jeff", the database had to put a pointer where "Jef" was originally stored pointing to the location of the new value of "Jeff". And if that got updated to "Jeffrey", then a second pointer was placed where "Jeff" was stored and pointed to the location where "Jeffrey" is stored.
Does anyone know if updating a Type 2 SCD in a compressed table works the same way as updates to a VARCHAR field?
In other words, is it best practices to compress dimension tables that have Type 2 SCD columns? Of course, this assumes that best practices with Type 2 SCD is to use fixed length fields.
By the way, I am seeing a 50% reduction in the amount of disk space used by Fact tables that have been compressed, and that includes the space used by the Indexing. Very nice feature.
Does anyone know if updating a Type 2 SCD in a compressed table works the same way as updates to a VARCHAR field?
In other words, is it best practices to compress dimension tables that have Type 2 SCD columns? Of course, this assumes that best practices with Type 2 SCD is to use fixed length fields.
By the way, I am seeing a 50% reduction in the amount of disk space used by Fact tables that have been compressed, and that includes the space used by the Indexing. Very nice feature.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
Hi Jeff,
I can't answer your question about compression, but I can respond to your assumption about VARCHAR vs CHAR.
I'm quite experienced with SQL Server and I have never heard of using CHAR instead of VARCHAR for dimension values. Even if what you were saying was true, there's so many reasons to not use a CHAR field that any query performance increase would be offset by the disadvantages.
I don't understand this statement :" If, for example, a Type 2 field had the value "Jef" in a row, which was updated to "Jeff", the database had to put a pointer where "Jef" was originally stored pointing to the location of the new value of "Jeff". And if that got updated to "Jeffrey", then a second pointer was placed where "Jeff" was stored and pointed to the location where "Jeffrey" is stored."
What do you mean by a pointer? The first record is end-dated, and the second record is end-dated. Are you referring to the indexes?
I just ran a test comparing two dimension tables with SCD2, one with varchar and the other char. The varchar ran slightly faster - 62ms compared to 78ms.
I'm happy to post the code of my test if you're interested.
John
I can't answer your question about compression, but I can respond to your assumption about VARCHAR vs CHAR.
I'm quite experienced with SQL Server and I have never heard of using CHAR instead of VARCHAR for dimension values. Even if what you were saying was true, there's so many reasons to not use a CHAR field that any query performance increase would be offset by the disadvantages.
I don't understand this statement :" If, for example, a Type 2 field had the value "Jef" in a row, which was updated to "Jeff", the database had to put a pointer where "Jef" was originally stored pointing to the location of the new value of "Jeff". And if that got updated to "Jeffrey", then a second pointer was placed where "Jeff" was stored and pointed to the location where "Jeffrey" is stored."
What do you mean by a pointer? The first record is end-dated, and the second record is end-dated. Are you referring to the indexes?
I just ran a test comparing two dimension tables with SCD2, one with varchar and the other char. The varchar ran slightly faster - 62ms compared to 78ms.
I'm happy to post the code of my test if you're interested.
John
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
I said Type 2 but meant Type 1.
My understanding of VARCHAR fields is that if a value is JEF, the database allocates 3 bytes for the value. If at a later time, the value JEF is changed to JEFF, JEFF won't fit in the space allocated for the original value. So instead of placing JEFF where JEF was located, it puts a pointer where JEF is located and points to a new spot where JEFF is loacted.
If a second change is made, and JEFF gets changed to JEFFREY, a pointer is placed where JEFF is located pointing to the spot on the disk where JEFFREY is now located. If searching for JEFFREY, the head of the disk goes to the original spot of JEF, sees a pointer where where JEFF is supposed to be, goes to that spot only to find another pointer to where JEFFREY is located.
I think running tests on a table with CHARs, and then on the same table with VARCHARs may not be valid because it doesn't account for the changes to the values (like JEF).
My understanding of VARCHAR fields is that if a value is JEF, the database allocates 3 bytes for the value. If at a later time, the value JEF is changed to JEFF, JEFF won't fit in the space allocated for the original value. So instead of placing JEFF where JEF was located, it puts a pointer where JEF is located and points to a new spot where JEFF is loacted.
If a second change is made, and JEFF gets changed to JEFFREY, a pointer is placed where JEFF is located pointing to the spot on the disk where JEFFREY is now located. If searching for JEFFREY, the head of the disk goes to the original spot of JEF, sees a pointer where where JEFF is supposed to be, goes to that spot only to find another pointer to where JEFFREY is located.
I think running tests on a table with CHARs, and then on the same table with VARCHARs may not be valid because it doesn't account for the changes to the values (like JEF).
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
I think you are trying to split hairs here...
Most databases, when they do an "update in place" actually insert new rows. The reason for this is to provide transactional consistancy. If a long query is started and during that query and update occurs prior to the query getting to that row, transactional consistancy requires that the query see the row as it existed before the update. So, databases typically retain the old row (flagging it as obsolete) and insert a new row with the updated data. Data types don't matter.
In the grand scheme of things, memory and processing power is really cheap and fast and disk is really slow. Any time you can reduce the volume of data you need to read off physical disk you will be better off. The rest of it doesn't really matter.
Most databases, when they do an "update in place" actually insert new rows. The reason for this is to provide transactional consistancy. If a long query is started and during that query and update occurs prior to the query getting to that row, transactional consistancy requires that the query see the row as it existed before the update. So, databases typically retain the old row (flagging it as obsolete) and insert a new row with the updated data. Data types don't matter.
In the grand scheme of things, memory and processing power is really cheap and fast and disk is really slow. Any time you can reduce the volume of data you need to read off physical disk you will be better off. The rest of it doesn't really matter.
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
Then how does Oracle's VARCHAR2 field differ for the VARCHAR.
I agree that pulling data off the disk as quickly as possible is the key. It was just my understanding that updates to VARCHAR fields could require the head to read multiple places on the disk. If that is true, then a dimension table with a lot of Type 1 fields that get updated on a relatively frequent basis was actually faster than on the CHAR field. It was my understanding that this is the reason that searches on CHAR fields and joins on CHAR fields were so more efficient than on VARCHAR fields.
I am far from an expert on datatypes. I pick up things along the way. No doubt that some of it was incorrect. But usually, if something new makes logical sense to me then I am more than willing to adopt it.
I agree that pulling data off the disk as quickly as possible is the key. It was just my understanding that updates to VARCHAR fields could require the head to read multiple places on the disk. If that is true, then a dimension table with a lot of Type 1 fields that get updated on a relatively frequent basis was actually faster than on the CHAR field. It was my understanding that this is the reason that searches on CHAR fields and joins on CHAR fields were so more efficient than on VARCHAR fields.
I am far from an expert on datatypes. I pick up things along the way. No doubt that some of it was incorrect. But usually, if something new makes logical sense to me then I am more than willing to adopt it.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
Not sure what the question "Then how does Oracle's VARCHAR2 field differ for the VARCHAR." is asking... the "2" doesn't mean a whole lot... it may have meant something in Oracle's early days when, I guess, VARCHAR was a data type, but in the last 20 years or so, as far as I know, it has been VARCHAR2. Wither you use CHAR (fixed length) or VARCHAR2 (variable length) in Oracle, it doesn't change how it handles updates.
Besides, updates are to rows, not individual columns. The row doesn't get fractured, the entire row is intact, it is just placed elsewhere in the table space. Where a particular row is, in and of itself, doesn't mean anything. What matters is block density... particularly with table scans... which is a measure of the number of valid rows retrieved in a block read. If a table has heavy update activity, most DB's have utilities to compress and reclaim 'dead' space. And, if the query uses indexes, density doesn't matter much since it is going all over the place to get the rows anyway.
Besides, updates are to rows, not individual columns. The row doesn't get fractured, the entire row is intact, it is just placed elsewhere in the table space. Where a particular row is, in and of itself, doesn't mean anything. What matters is block density... particularly with table scans... which is a measure of the number of valid rows retrieved in a block read. If a table has heavy update activity, most DB's have utilities to compress and reclaim 'dead' space. And, if the query uses indexes, density doesn't matter much since it is going all over the place to get the rows anyway.
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
My understanding of Oracle's VARCHAR and VARCHAR2 was that the VARCHAR2 did not contain the trailing spaces as a the CHAR datatype would but reserved the entire space like the CHAR datatype would. Based on what you are saying, the VARCHAR field would be preferable to the VARCHAR2 field.
And do all databases act they way you describe?
And do all databases act they way you describe?
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
Jeff,
I think you’re referring to page splits. Off the top of my head, it won’t cause any issues if a page split occurs, because the index is automatically updated to point to the correct data page.
I think you’re referring to page splits. Off the top of my head, it won’t cause any issues if a page split occurs, because the index is automatically updated to point to the correct data page.
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
I think I have it - Updates are really deletes and inserts and therefore allocated space doesn't play a role at least in SQL Server 2K8.
Try to use VARCHAR fields whenever the field is more than 2 bytes. Never use a VARCHAR(1) as it uses more space than a CHAR(1). If you know that the values in a field have a uniformed size (such as State Abbreviation or Zip Code) use the CHAR datatype unless the field is populated unevenly.
Try to use VARCHAR fields whenever the field is more than 2 bytes. Never use a VARCHAR(1) as it uses more space than a CHAR(1). If you know that the values in a field have a uniformed size (such as State Abbreviation or Zip Code) use the CHAR datatype unless the field is populated unevenly.
Jeff Smith- Posts : 471
Join date : 2009-02-03
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
I'd like to step back from the detail and offer a high-level perspective relating to the start of this thread:
I've just been working on upgrading a small DW/Data Mart (under 20GB) from SQL 2005 to 2008. I switched on compression after intially stabilising the migrated application and running it for a while, so I have been able to compare like for like.
On the SQL 2008 environment, comparing pre & post compression, data volume reduced as expected. Interestingly, ETL (SSIS) elapsed time and cube build time also both reduced by 5-10%. I was expecting an increase with compression, especially as the server is CPU-bound with only a single-core VM under the hood. I suspect this is the benefit of lower memory use and faster IO from inter-DW queries e.g. looking up surrogate keys, SCD logic, deriving aggregates etc.
BTW I'm compressing all tables and indexes in PAGE mode.
So for now, I'm thinking Compression is all good, and I'm planning to roll it out for my other clients with similar apps.
I'm interested in others experiences in this area.
Mike
I've just been working on upgrading a small DW/Data Mart (under 20GB) from SQL 2005 to 2008. I switched on compression after intially stabilising the migrated application and running it for a while, so I have been able to compare like for like.
On the SQL 2008 environment, comparing pre & post compression, data volume reduced as expected. Interestingly, ETL (SSIS) elapsed time and cube build time also both reduced by 5-10%. I was expecting an increase with compression, especially as the server is CPU-bound with only a single-core VM under the hood. I suspect this is the benefit of lower memory use and faster IO from inter-DW queries e.g. looking up surrogate keys, SCD logic, deriving aggregates etc.
BTW I'm compressing all tables and indexes in PAGE mode.
So for now, I'm thinking Compression is all good, and I'm planning to roll it out for my other clients with similar apps.
I'm interested in others experiences in this area.
Mike
Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables
On a project I was associated with that had large volumes of data (5 billion rows plus fact table), after using compression with SQL Server 2008, there were reductions of 40% in terms of disk space and in ETL processing times.
Similar topics
» What to look for when finding the target for table compression
» SQL Server 2008 Date data type as dimension key
» Multivalued Dimension, bridge table, and constraints using SQL Server 2005
» How do I join a role playing date dimension view with a Fact table in SQL Server?
» SQL Server 2008 Date data type as dimension key
» Multivalued Dimension, bridge table, and constraints using SQL Server 2005
» How do I join a role playing date dimension view with a Fact table in SQL Server?
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum