During my session CSI:DW – Anatomy of a VLDW at SQL Saturday #402 in Indianapolis (and I’ll be repeating that session at SQL Saturday #403 in Louisville in less than two weeks), I talked about managing schema changes for very large data warehouse tables… So I thought I’d blog about the solution we came up with at the VA for handling schema changes.
Imagine that you have a large data warehouse environment. Also imagine that you have a fact table with 2B+ rows in it. And that you have several hundred (if not thousands) of users who query that fact table in various ways every day. As luck would have it, you need to change the schema of the table. Maybe you need to add a column (which is the most common scenario in our case at the VA). Or maybe you need to change a column’s data type (hopefully that wouldn’t happen, but…) How do you handle that for a table that large? Perhaps you could manage to add a column by using an ALTER TABLE statement, but then you’d have to deal with updating all 2B+ rows to get that column filled in. And perhaps you could handle a data type change by adding a new column with the new data type (albeit also with a new column name). In most cases, anything you do to the table will likely impact users in one way or another. And we all know that users like to complain when they can’t get their data as expected when expected!
After dealing with this issue a few times, the DW team at the VA came up with table versioning as a standard methodology for handling any and all schema change scenarios. There are a couple of key concepts that are involved with this solution:
- Every table name has a version number appended to the table name. Thus, we have physical table names like [Dim].[Location_v025] and [Appt].[Appointment_v062].
- All user access to data warehouse tables is managed via views and the view names do NOT reflect version numbers. So, we have view names like [Dim].[Location] and [Appt].[Appointment] and users are only able to select data from the views.
- Surrogate keys are assigned to each table based on a separate DW surrogate ID table that is paired up with each table. These tables all exist in a [DWSID] database schema. For example, for the [Dim].[Location_v025] table, we have a table named [DWSID].[Location]. This table contains a surrogate key column, implemented as an IDENTITY() column, along with the business key(s) for that table. The versioned table also includes the surrogate key column and business keys, but the surrogate key column is NOT marked as an IDENTITY() column in the versioned table.
With these concepts in place, we can now create a new version of a table independent of the existing version. Assuming the location dimension has to be changed in some way, our data modeling team will model the new table and assign it a new version number (which is actually handled automatically by their modeling tool, ER/Studio). For example, they might end up creating [Dim].[Location_v059] as a result of the new modeling effort. The new table can be created alongside the existing versioned table, with no ill effects to the existing version or user access to the existing version (since the [Dim].[Location] view now gets data from [Dim].[Location_v025]. The ETL developers will create a new load process (we use SSIS packages for all of the ETL work at the VA) for loading the new version of the table and execute it (typically via scheduled job) to get the new table loaded. Meanwhile, the existing load process for the existing version of the table can continue to run. Both processes (SSIS packages) can and do insert records into the [DWSID].[Location] table as needed when one or the other encounters a new business key (i.e., a new source record) in the source system(s). Each process can then correctly add that same record to each version of the table – and can assign the right (and the same) surrogate key based on a lookup against the [DWSID].[Location] table using the business key(s).
With both tables populated, the ETL developers (and others) can then take a look at the new table to verify its new schema, its data, etc. In fact, both tables can remain “in production” for several days or even weeks as needed to ensure the ETL process and the data is right. Once the DW team is ready to switch users to the new version, ALTER VIEW statements can be used to change the view to reflect the new table version and its schema change. In the example above, the [Dim].[Location] view can be changed so that it gets its data from the new [Dim].[Location_v059] table instead of the old [Dim].[Location_v025] table. When complete, users will see the schema changes and the old table can be deprecated and dropped from the database.
Of course, nothing comes for free. Using table versioning does imply that we have enough disk space available to create a full clone of a table – including additional space for any new columns, any larger data types used, any additional indexes planned, etc. Typically this isn’t a problem but there have been times when we’ve had to add additional space before generating a new version of some of our largest tables. The good news is that space can be reclaimed (if needed) once the old version of a table is deprecated.
Although table versioning may sound complicated, it has become part of our overall DW architecture at the VA and has worked out well over the years (we’ve been following this approach for almost 5 years now). Most DW users are not aware of the technical details, but do appreciate the fact that we can announce schema changes and then (typically) enact those changes on an exact date as announced (since the only change required to “publish” a new version of a table once it is ready is the DDL statement used to alter the view)…
I’d be interested to hear what other data warehouse professionals think of this idea, so leave a comment here if you want or drop me a note at firstname.lastname@example.org.