Friday, August 28, 2015

The “Case” of the Failing SSIS Lookup

I recently ran across an issue that reminded me of an often overlooked “feature” of the SSIS lookup task.  Before jumping to the punch line, let me set the stage…

I was working on a very simple SSIS package.  Basically, just pulling some data from an operational system, landing that data into a staging table, then using a lookup task to determine whether to insert or update specific records in a final destination table.  Lots of ways to do that, but in this case, I was using a lookup task to bounce the records off the destination table.  If a given record had a lookup match, it became an update; otherwise, it became an insert.  After testing the packages several times, it was deployed to production and ran successfully.  Most of the time.  Well, about half of the time.  The other half of the time, the package would fail and the error messages indicated that inserts were failing due to duplicate key violations.  The destination table had a unique constraint (index) in place on its set of business keys – and the package was failing due to unique constraint violations when attempting to insert the same business key…

Of course, at first glance, this didn’t make much sense.  You see, the lookup task was using the same set of business keys, so any record flowing through the package and through the lookup *should* have been found if the given set of business key values existed in the destination table.  And any records with a successful lookup match should have been routed to the update part of the data flow (which was actually just landing those records into another staging table that was used after the data flow to do a set-based update).  So, how was it that sometimes records would fail to be found by the lookup task and flow to the insert part of the data flow but then fail to be inserted into the table based on unique constraint violations?  I’ll admit that at first, I was stumped.  I removed the mappings in the lookup task and reset them, thinking maybe that would help.  Nope.  I deleted the lookup task and recreated it, thinking maybe that would help.  Nope.  Okay, time for some serious troubleshooting here!  So, I created another (temporary) staging table and reconfigured the insert part of the data flow to insert the records into that new staging table instead of the real destination table.  And ran the package a few times.  After each execution, I examined both the new staging table as well as the other “updates” staging table to see where the lookup had routed all of the incoming rows.  Since the package was pulling from a live operational system, it was hard to get the same set of records each time (which made the troubleshooting a bit harder), but I was able to see that indeed, sometimes the lookup would route “existing” records to the new staging table that represented those rows that had failed the lookup.  When I manually joined that new staging table to the destination table based on the business keys, the join did indeed verify that some of the records existed in the destination table even though the lookup thought they did not.

So, what was going on here?  After scratching my head a few times, I reached out to a good friend of mine, Tom Huneke from Key2 Consulting.  Tom and I work together as consulting resources for the Department of Veterans Affairs.  After going through the package with Tom and looking at the results captured in the two staging tables, Tom immediately saw the problem – the records that the lookup task failed to find (and therefore routed to the new staging table that represented inserts) had the same business key values as existing records in the destination table – except in the “case” of the data.  For example, a record in the destination table had a value of “TBICUBE” in one of the business key columns while the record that had “failed” the lookup and was flagged as a new record had the value of “TBICube”.  I had forgotten that although your source and destination systems (in this case, both SQL Server) might be configured to be case-insensitive, SSIS and its various tasks are NOT case-insensitive.  While I was able to successfully join the records with “TBICUBE” and “TBICube” as values using T-SQL (and those values were considered equivalent), SSIS and its lookup task didn’t see it that way.

The quick fix was to change the lookup task to use a query that forced the business key values to lower case and add a derived column task before the lookup to create lower-case versions of the incoming data columns.  Configuring the lookup task to ensure it was comparing on lower-case only values solved the problem.  As to why the package was succeeding sometimes and failing sometimes, a bit more research of the source system revealed that sometimes it recorded values (in the example above, the “TBICube” value) using all upper case, sometimes using mixed case, and sometimes using all lower case.  Don’t get me started on why it might have been doing that – it is what it is in this case and will likely never change.  As a result, sometimes the recorded values in the source records would match (as the lookup task was using case-sensitive matching) destination table records, and sometimes they wouldn’t.

In the end, SSIS and its lookup task were operating exactly as designed – and exactly as documented.  Sometimes even those of us who have been creating SSIS solutions for more years than we care to admit will forget about something as simple as this – and spend way to much time troubleshooting the problem and overlooking the obvious!

Thursday, August 27, 2015

Why We Speak and Volunteer for SQL Saturday

Last weekend, I had the pleasure of both speaking and volunteering for SQL Saturday #403 in Louisville.  What a blast!  From feedback that we’ve received, it seems to have been our most successful event to date in Louisville.  We maxed out our registration limit of 250 (a great crowd for our city), had great feedback from our sponsors, and generally seemed to exceed expectations of the attendees in terms of the sessions we offered, the speakers we had, the food we brought in, etc.  Just a great event all around!!

After the event, I got an awesome email from my good friend Chris Yates (blog | twitter | linkedin), fellow volunteer for the event and all around great guy.  Here’s the text of that email (posted with his permission):

A BI developer came up to me as I was leaving yesterday. I quote verbatim:

“Mr. Yates I owe you an apology”

I ask why?

“I’ve been doing things wrong for the last year and I didn’t realize it nor the trouble I was causing you. I’m going to work toward fixing my processes”

I ask what brought this on?

“I sat in a guy’s class named Dave Fackler at SQL Saturday, and I understand now why things have not been working the way they should”

BOOOOM ~ and you tell me what we do doesn’t pay off! Good job homie, and thanks for making my life easier here :-)

Talk about a great feeling!  Yes, my head did swell a bit and my chest did stick out a bit farther for the rest of the day!  But the point here is really the last sentence in that email from Chris – that we do what we do (in this case, volunteering to speak and volunteering to make events like this happen) in order to make a difference.  Maybe we’re able to just make a difference because someone learns something, or maybe we’re able to make a difference in how someone does there job.  And maybe either of those will make a bigger difference to the company they work for – either now or down the road.  Sure, its fun to be a speaker and to have a room full of people hanging on your every word (while hoping beyond hope that your demos go well and that you don’t say something stupid!).  But the reason most of us speak is to try to make a difference…

And events like this (along with other PASS events, local PASS chapter meetings, etc.) just wouldn’t happen without the volunteers who spend countless hours making it all work out.  Most people don’t understand what goes into making a great event happen until they volunteer – at least I didn’t.  And sometimes being a volunteer can be aggravating – but in the end, it is almost always worth it.  To see the excitement of the attendees on the day of the event, to see the smiles of the speakers who appreciate the work that has gone into the event, and to hear the buzz as sponsors engage with people throughout the day – just WOW.  It is all worth it in the end…

So, what have you volunteered for lately?  If you’re passionate about SQL Server (and I assume the few people who might actually read this are), then there’s no better way to help your community of fellow SQL Server professionals (affectionately referred to on Twitter as #sqlfamily) than to volunteer – get involved with your local PASS chapter, help out with the next SQL Saturday event in your city, find out how to get engaged as a volunteer at the next PASS Summit.  In the end, you’ll love it and you’ll think “Why didn’t I do this sooner??”

Monday, August 10, 2015

Table Versioning for Schema Changes

During my session CSI:DW – Anatomy of a VLDW at SQL Saturday #402 in Indianapolis (and I’ll be repeating that session at SQL Saturday #403 in Louisville in less than two weeks), I talked about managing schema changes for very large data warehouse tables…  So I thought I’d blog about the solution we came up with at the VA for handling schema changes.

Imagine that you have a large data warehouse environment.  Also imagine that you have a fact table with 2B+ rows in it.  And that you have several hundred (if not thousands) of users who query that fact table in various ways every day.  As luck would have it, you need to change the schema of the table.  Maybe you need to add a column (which is the most common scenario in our case at the VA).  Or maybe you need to change a column’s data type (hopefully that wouldn’t happen, but…)  How do you handle that for a table that large?  Perhaps you could manage to add a column by using an ALTER TABLE statement, but then you’d have to deal with updating all 2B+ rows to get that column filled in.  And perhaps you could handle a data type change by adding a new column with the new data type (albeit also with a new column name).  In most cases, anything you do to the table will likely impact users in one way or another.  And we all know that users like to complain when they can’t get their data as expected when expected!

After dealing with this issue a few times, the DW team at the VA came up with table versioning as a standard methodology for handling any and all schema change scenarios.  There are a couple of key concepts that are involved with this solution:

  • Every table name has a version number appended to the table name.  Thus, we have physical table names like [Dim].[Location_v025] and [Appt].[Appointment_v062].
  • All user access to data warehouse tables is managed via views and the view names do NOT reflect version numbers.  So, we have view names like [Dim].[Location] and [Appt].[Appointment] and users are only able to select data from the views.
  • Surrogate keys are assigned to each table based on a separate DW surrogate ID table that is paired up with each table.  These tables all exist in a [DWSID] database schema.  For example, for the [Dim].[Location_v025] table, we have a table named [DWSID].[Location].  This table contains a surrogate key column, implemented as an IDENTITY() column, along with the business key(s) for that table.  The versioned table also includes the surrogate key column and business keys, but the surrogate key column is NOT marked as an IDENTITY() column in the versioned table.

With these concepts in place, we can now create a new version of a table independent of the existing version.  Assuming the location dimension has to be changed in some way, our data modeling team will model the new table and assign it a new version number (which is actually handled automatically by their modeling tool, ER/Studio).  For example, they might end up creating [Dim].[Location_v059] as a result of the new modeling effort.  The new table can be created alongside the existing versioned table, with no ill effects to the existing version or user access to the existing version (since the [Dim].[Location] view now gets data from [Dim].[Location_v025].  The ETL developers will create a new load process (we use SSIS packages for all of the ETL work at the VA) for loading the new version of the table and execute it (typically via scheduled job) to get the new table loaded.  Meanwhile, the existing load process for the existing version of the table can continue to run.  Both processes (SSIS packages) can and do insert records into the [DWSID].[Location] table as needed when one or the other encounters a new business key (i.e., a new source record) in the source system(s).  Each process can then correctly add that same record to each version of the table – and can assign the right (and the same) surrogate key based on a lookup against the [DWSID].[Location] table using the business key(s).

With both tables populated, the ETL developers (and others) can then take a look at the new table to verify its new schema, its data, etc.  In fact, both tables can remain “in production” for several days or even weeks as needed to ensure the ETL process and the data is right.  Once the DW team is ready to switch users to the new version, ALTER VIEW statements can be used to change the view to reflect the new table version and its schema change.  In the example above, the [Dim].[Location] view can be changed so that it gets its data from the new [Dim].[Location_v059] table instead of the old [Dim].[Location_v025] table.  When complete, users will see the schema changes and the old table can be deprecated and dropped from the database.

Of course, nothing comes for free.  Using table versioning does imply that we have enough disk space available to create a full clone of a table – including additional space for any new columns, any larger data types used, any additional indexes planned, etc.  Typically this isn’t a problem but there have been times when we’ve had to add additional space before generating a new version of some of our largest tables.  The good news is that space can be reclaimed (if needed) once the old version of a table is deprecated.

Although table versioning may sound complicated, it has become part of our overall DW architecture at the VA and has worked out well over the years (we’ve been following this approach for almost 5 years now).  Most DW users are not aware of the technical details, but do appreciate the fact that we can announce schema changes and then (typically) enact those changes on an exact date as announced (since the only change required to “publish” a new version of a table once it is ready is the DDL statement used to alter the view)…

I’d be interested to hear what other data warehouse professionals think of this idea, so leave a comment here if you want or drop me a note at

Saturday, August 8, 2015

Hello Blog, It’s Been a While…

Wow, its been a long time since I’ve created a blog post!  The last one was back in September of 2012…  Almost 3 years ago!  During the last several years, I’ve been heads down working on the data warehouse and BI environment at the Department of Veterans Affairs (VA), which has kept me very, very busy.  Of course, that’s no excuse really for not blogging.  So, I’ve decided its time to start sharing things we’ve done over the last several years at the VA and try to be a more active blogger (and twitterer – or whatever the right term is for that).  I’ve recently changed by twitter handle to @sqlbiguru, so look me up there if you want…  Cheers!

Monday, September 10, 2012

My Favorite SSIS Tips and Tricks–Tip #4

My next tip is related to a post that I created back in February 2008 (more on that in a minute).  This tip has to do with Connection Manager names.  When you create a new Connection Manager, the resulting object is typically named with two parts – the server name and the database name.  So, for example, a Connection Manager that connects to the FinMart database on a server named DevTeam01, that Connection Manager’s name will be DevTeam01.FinMart as shown below:


Now, this might seem fine – until you get ready to move this package to a different server for testing, QA, or even production use.  At that point, the Connection Manage name might be very misleading since it may no longer connect to the DevTeam01 server.  So, the first part of this tip is to remove the server name from your Connection Manager names…

Now, back to that February 2008 post.  That post recommended that you always create two Connection Managers for use with SQL Server databases.  The first uses an OLE DB provider (either the SQL Server Native Client or the Microsoft OLE DB Provider for SQL Server) while the second uses an ADO.Net provider (using the SqlClient Data Provider).  In the post, I discussed using the first Connection Manager (based on the OLE DB provider) within Data Flow tasks and using the second Connection Manager (based on the ADO.Net provider) for Execute SQL tasks (since it is easier to work with parameterized queries like calls to stored procedures).  I still stand by that tip and continue to do that in just about every package I develop to this day (perhaps that will change in SSIS 2012 – I just haven’t made that leap yet on with my current project).

So, in keeping with the naming aspect of this tip, I suggest naming your Connection Managers using two-part names – the first part reflects the database name and the second part reflects the provider being used.  Thus, in the example above, if that Connection Manager originally created was based on an OLE DB provider, I would name it FinMart.OLEDB.  And then I would add a second Connection Manager to the same database using an ADO.Net provider and name it FinMart.ADONET.  So my package would have these two (at least) Connection Managers:


By having both Connection Managers in place, I can easily know which one to use for which task and have the best of both worlds (working in Data Flows and with the Execute SQL task…)

Happy SSIS’ing!!