SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Jeff Smith Wed Jul 21, 2010 11:48 am

We are converting to SQL Server 2008 and plan to use the data Compression feature. As I understand it, the Compression method employed in the data compression removes trailing spaces in fixed length fields (this is the Row compression memthod, which is also performed by the Page Compression method). I am under the impression that the compression treats a CHAR(20) like a VARCHAR(20). No none issue in Fact Tables, but could it be an issue for Dimension Tables with Type 2 SCD in a Fixed Length field. I was taught to use Fixed length fields with Type 2 SCD because they were a more efficient data type from a read perspective (less efficient storage). If, for example, a Type 2 field had the value "Jef" in a row, which was updated to "Jeff", the database had to put a pointer where "Jef" was originally stored pointing to the location of the new value of "Jeff". And if that got updated to "Jeffrey", then a second pointer was placed where "Jeff" was stored and pointed to the location where "Jeffrey" is stored.

Does anyone know if updating a Type 2 SCD in a compressed table works the same way as updates to a VARCHAR field?

In other words, is it best practices to compress dimension tables that have Type 2 SCD columns? Of course, this assumes that best practices with Type 2 SCD is to use fixed length fields.

By the way, I am seeing a 50% reduction in the amount of disk space used by Fact tables that have been compressed, and that includes the space used by the Indexing. Very nice feature.

John Simon Thu Jul 22, 2010 11:39 pm

Hi Jeff,
I can't answer your question about compression, but I can respond to your assumption about VARCHAR vs CHAR.

I'm quite experienced with SQL Server and I have never heard of using CHAR instead of VARCHAR for dimension values. Even if what you were saying was true, there's so many reasons to not use a CHAR field that any query performance increase would be offset by the disadvantages.
I don't understand this statement :" If, for example, a Type 2 field had the value "Jef" in a row, which was updated to "Jeff", the database had to put a pointer where "Jef" was originally stored pointing to the location of the new value of "Jeff". And if that got updated to "Jeffrey", then a second pointer was placed where "Jeff" was stored and pointed to the location where "Jeffrey" is stored."

What do you mean by a pointer? The first record is end-dated, and the second record is end-dated. Are you referring to the indexes?

I just ran a test comparing two dimension tables with SCD2, one with varchar and the other char. The varchar ran slightly faster - 62ms compared to 78ms.
I'm happy to post the code of my test if you're interested.

John

Jeff Smith Tue Aug 03, 2010 3:49 pm

I said Type 2 but meant Type 1.

My understanding of VARCHAR fields is that if a value is JEF, the database allocates 3 bytes for the value. If at a later time, the value JEF is changed to JEFF, JEFF won't fit in the space allocated for the original value. So instead of placing JEFF where JEF was located, it puts a pointer where JEF is located and points to a new spot where JEFF is loacted.

If a second change is made, and JEFF gets changed to JEFFREY, a pointer is placed where JEFF is located pointing to the spot on the disk where JEFFREY is now located. If searching for JEFFREY, the head of the disk goes to the original spot of JEF, sees a pointer where where JEFF is supposed to be, goes to that spot only to find another pointer to where JEFFREY is located.

I think running tests on a table with CHARs, and then on the same table with VARCHARs may not be valid because it doesn't account for the changes to the values (like JEF).

ngalemmo Tue Aug 03, 2010 4:00 pm

I think you are trying to split hairs here...

Most databases, when they do an "update in place" actually insert new rows. The reason for this is to provide transactional consistancy. If a long query is started and during that query and update occurs prior to the query getting to that row, transactional consistancy requires that the query see the row as it existed before the update. So, databases typically retain the old row (flagging it as obsolete) and insert a new row with the updated data. Data types don't matter.

In the grand scheme of things, memory and processing power is really cheap and fast and disk is really slow. Any time you can reduce the volume of data you need to read off physical disk you will be better off. The rest of it doesn't really matter.

Jeff Smith Tue Aug 03, 2010 4:34 pm

Then how does Oracle's VARCHAR2 field differ for the VARCHAR.

I agree that pulling data off the disk as quickly as possible is the key. It was just my understanding that updates to VARCHAR fields could require the head to read multiple places on the disk. If that is true, then a dimension table with a lot of Type 1 fields that get updated on a relatively frequent basis was actually faster than on the CHAR field. It was my understanding that this is the reason that searches on CHAR fields and joins on CHAR fields were so more efficient than on VARCHAR fields.

I am far from an expert on datatypes. I pick up things along the way. No doubt that some of it was incorrect. But usually, if something new makes logical sense to me then I am more than willing to adopt it.

ngalemmo Tue Aug 03, 2010 4:53 pm

Not sure what the question "Then how does Oracle's VARCHAR2 field differ for the VARCHAR." is asking... the "2" doesn't mean a whole lot... it may have meant something in Oracle's early days when, I guess, VARCHAR was a data type, but in the last 20 years or so, as far as I know, it has been VARCHAR2. Wither you use CHAR (fixed length) or VARCHAR2 (variable length) in Oracle, it doesn't change how it handles updates.

Besides, updates are to rows, not individual columns. The row doesn't get fractured, the entire row is intact, it is just placed elsewhere in the table space. Where a particular row is, in and of itself, doesn't mean anything. What matters is block density... particularly with table scans... which is a measure of the number of valid rows retrieved in a block read. If a table has heavy update activity, most DB's have utilities to compress and reclaim 'dead' space. And, if the query uses indexes, density doesn't matter much since it is going all over the place to get the rows anyway.

Jeff Smith Tue Aug 03, 2010 5:14 pm

My understanding of Oracle's VARCHAR and VARCHAR2 was that the VARCHAR2 did not contain the trailing spaces as a the CHAR datatype would but reserved the entire space like the CHAR datatype would. Based on what you are saying, the VARCHAR field would be preferable to the VARCHAR2 field.

And do all databases act they way you describe?

John Simon Tue Aug 03, 2010 7:27 pm

Jeff,
I think you’re referring to page splits. Off the top of my head, it won’t cause any issues if a page split occurs, because the index is automatically updated to point to the correct data page.

Jeff Smith Wed Aug 04, 2010 9:28 am

I think I have it - Updates are really deletes and inserts and therefore allocated space doesn't play a role at least in SQL Server 2K8.

Try to use VARCHAR fields whenever the field is more than 2 bytes. Never use a VARCHAR(1) as it uses more space than a CHAR(1). If you know that the values in a field have a uniformed size (such as State Abbreviation or Zip Code) use the CHAR datatype unless the field is populated unevenly.

Mike Honey Fri Aug 06, 2010 3:29 am

I'd like to step back from the detail and offer a high-level perspective relating to the start of this thread:

I've just been working on upgrading a small DW/Data Mart (under 20GB) from SQL 2005 to 2008. I switched on compression after intially stabilising the migrated application and running it for a while, so I have been able to compare like for like.

On the SQL 2008 environment, comparing pre & post compression, data volume reduced as expected. Interestingly, ETL (SSIS) elapsed time and cube build time also both reduced by 5-10%. I was expecting an increase with compression, especially as the server is CPU-bound with only a single-core VM under the hood. I suspect this is the benefit of lower memory use and faster IO from inter-DW queries e.g. looking up surrogate keys, SCD logic, deriving aggregates etc.

BTW I'm compressing all tables and indexes in PAGE mode.

So for now, I'm thinking Compression is all good, and I'm planning to roll it out for my other clients with similar apps.

I'm interested in others experiences in this area.

Mike

John Simon Sun Aug 08, 2010 12:24 am

On a project I was associated with that had large volumes of data (5 billion rows plus fact table), after using compression with SQL Server 2008, there were reductions of 40% in terms of disk space and in ETL processing times.

Sponsored content

SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables

Re: SQL Server 2008 - CHAR and table compression vs VARCHAR in Dimension Tables