Sunday, February 24, 2008

Incorrect Truncation Warning in SSIS

I was working on an SSIS package a few weeks back and was puzzled by a truncation warning that two of the Data Flow tasks kept throwing out.  In the Data Flow, a string-based column of length 10 in the pipeline was being used as part of the input for an OLE DB Destination, which was storing the data into a SQL Server table.  The column being populated by this specific input column had a data type of varchar(20).  No problem, right?

Well, the Data Flow task was generating a warning stating that the column of length 20 might be truncated when used to populate the column of length 10.  Huh?  Exactly what I thought.  Why was the Data Flow task complaining about that when clearly the column of length 20 was the destination and the column of length 10 was the source.

I reviewed every property I could think of with regards to the source column, the destination column, the destination table, the OLE DB Destination component, the Data Flow task, etc.  Nothing I could find (at first glance) seemed out of place.  And even though the package ran fine (it was, after all, just a warning), it continued to bug me that something about the package was causing the SSIS engine to generate the warning.

After scratching my head for quite some time, I finally found the answer.  It seemed that someone working on the package prior to my development work on it had set the ValidateExternalMetadata property on the OLE DB Destination to False.  This is typically done (in my experience) to keep a Data Flow task from complaining about the metadata of a destination table that doesn't exist yet (maybe it will be created in a step before the Data Flow task executes, for example).  Well, believe it or not, setting this property to False seemed to cause the SSIS engine to misinterpret the metadata information about the source and destination, in this case, swapping the column metadata.

Here's a quick example to see this in action.  Create a Data Flow task in a package, add an OLE DB Source component, a Derived Column component, and an OLE DB Destination component.  For the OLE DB Source component, select a table from the AdventureWorks database (like the Production.Product table) and pick a character-based column.  Connect the OLE DB Source component to the Derived Column component.  In the Derived Column component, create a new column (with a new name) and use the SUBSTRING() function to make a shorter version of the string column.  Send the output of the Derived Column component to the OLE DB Destination, mapping the source to the original table, using the shorter column you created as the source for the column you originally picked from the table.  So, you end up with something like this:

Simple Data Flow

In the OLE DB Source, I selected the Production.Product table and selected the Name field, which has a data type of nvarchar(50).

OLE DB Source Component

In the Derived Column component, I created a new column named ShorterName as follows:

Derived Column Component

Finally, for the OLE DB Destination, I mapped the new ShorterName column back to the Name column:

OLE DB Destination

Now, there's no need to really run the package to see the warning.  Simply change the ValidateExternalMetadata property for the OLE DB Destination component from the default of True to False.

OLE DB Destination Properties

As soon as you do this, you'll see a warning icon on the OLE DB Destination component.  If you look at the warning in the Error List, you'll see something like the following:

Validation warning. Data Flow Task: OLE DB Destination [140]: Truncation may occur due to inserting data from data flow column "Name" with a length of 50 to database column "ShorterName" with a length of 40.

Note that the truncation warning has inverted the metadata involved, stating that (in this case) truncation may occur inserting data from the Name column to the database column ShorterName.  But, as is clearly the case, the database column (the destination) is the Name column while the ShorterName column is actually the source.  If you set the ValidateExternalMetadata column back to True, the truncation warning goes away.

Strange behavior.  Again, just a warning that doesn't keep a given data flow from working properly.  But strange nonetheless.  I've posted this as a bug on Connect; if you come across it, feel free to comment or validate it.

http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=327577

Friday, February 22, 2008

Using Connection Managers in SSIS

Whenever I create SSIS packages, I typically create two different Connection Managers for use against SQL Server databases. The first uses a standard OLE DB provider, like this one to the AdventureWorks database (which I typically name something like AdventureWorks.OLEDB):

AdventureWorks.OLEDB

The second one, to the same database, uses an ADO.NET provider, like this one to the AdventureWorks database (which I typically name something like AdventureWorks.ADONET):

AdventureWorks.ADONET

Now, here's why I set up these two different Connection Managers:

The OLE DB provider seems to be faster than the ADO.NET provider for getting data into and out of a SQL Server database. Thus, in Data Flow tasks within the package, I use the first Connection Manager for OLE DB Source and OLE DB Destination components when hitting the database. Thus:

Data Flow Task

However, as anyone who has used the Execute SQL task along with an OLE DB-based Connection Manager will tell you, dealing with stored procedures and their parameters isn't the most elegant. If the Execute SQL task uses an OLE DB-based Connection Manager, then calls to stored procedures look like this:

Stored Procedure with Positional Parameters

In the Parameter Mappings tab of the task, the parameters then have to be specified using positional numbers, starting with 0. So, for the example above, you'd have to set up mappings to parameters as follows:

Positional Parameter Names

If, on the other hand, you use an ADO.NET-based Connection Manager with the Execute SQL Task, things are a bit more elegant. In this case, you can simply specify the name of the stored procedure without any positional parameter markers:

Stored Procedure Name Only

Note the property called IsQueryStoredProcedure below the name of the stored procedure. When an OLE DB-based Connection Manager is used (as in the first example above), this property is set to False and is grayed out. Now, however, with an ADO.NET-based Connection Manager, the property can be set to True whenever a stored procedure name is provided. As for the parameters, they can now be mapped using the actual parameter names within the stored procedure:

Actual Parameter Names

To me, this is much more intuitive and elegant. And, when you build lots of SSIS packages (for an ETL process, for example), you almost always have more than a few stored procedure calls that need to be made before Data Flow tasks, after them, in Event Handlers, etc. Thus, although it seems somewhat silly to have two Connection Managers to the same database, using one for each of the different providers for SQL Server allow you to have the best of both worlds -- good performance for getting data into and out of a database, and good (or at least better) stored procedure handling for parameter names and mapping.

Thursday, February 21, 2008

Using Schemas to Organize Data Warehouse Objects

Over the last several years, I've adopted the practice of using schemas to organize data warehouse objects in the SQL Server database created as part of a BI solution. I've found this is a good way to differentiate the objects based on their purpose. And its freed me from having to use suffixes or prefixes as I'd been doing for several years.

For example, in my early days of designing BI solutions, I'd often use table names like DimDate, DimProduct, and the always funny DimCustomer and DimEmployee for dimension tables. And things like FactSales, FactInventory, and FactCalls for fact tables. This organized things fairly well, but always felt a bit artificial.

With the introduction of schemas to SQL Server databases, however, I've standardized on a set of schemas for every data warehouse database I design. The tables above would then be named Dim.Date, Dim.Product, Dim.Customer, and Dim.Employee (as tables built within the Dim schema) along with Fact.Sales, Fact.Inventory, and Fact.Calls (as tables built within the Fact schema). Likewise, if I have views, stored procedures, functions, etc. that are related to the dimensions, they get created in the Dim schema while those related to the fact tables get created in the Fact schema.

Overall, I generally create the following schemas right out of the gate:

  • Dim -- used for all objects related to the dimensions in the model
  • Fact -- used for all objects related to the facts in the model
  • Audit -- used to hold objects related to auditing, generally a feature of the ETL process that loads the dimensions and facts
  • ETL -- used to hold objects directly related to the ETL process (configuration values, processing directives, etc.)
  • Stage -- used to hold all objects needed for staging data during the ETL process

In some cases, I've used other schemas to hold objects specific to a given solution or implementation. For example, I've been known to create the following:

  • History -- used to hold objects related to storing historical copies of data (for example, when you update a dimension record and need the pre-update version of the record just in case)
  • Report -- used to hold objects specifically used by reporting applications (such as views and stored procedures used directly by SSRS)

And while I know that schemas are generally used for security purposes, the ability to quickly glance at a list of tables, views, and stored procedures and know what they're there for is great. And since SSMS shows lists of objects alphabetically by schema, the list of tables and such are organized by schema. Pretty handy!

New Blog Up and Running

For those of you who know me (and I realize that's a much smaller audience than I'd like to think sometimes), I'm finally getting around to doing the blog thing.  A while back, I attempted to start a blog up while with another consulting firm (which has all but vanished I think).  But I didn't really do much with that blog, so I'm starting this new one with high hopes!

I've been working in the Microsoft BI space for quite a while now (having started with the alpha bits of SQL Server 7.0 -- trying valiantly to figure out the programming model for DTS before BOL existed -- on a project in Atlanta).  Since that time, I've come to have a pretty decent understanding of Microsoft's BI stack -- including just enough to be dangerous with MOSS.  But my focus continues to be the core components of SSIS and SSAS.  So, more than likely, that's what you'll see the most of in this blog.

Over the course of the next few weeks (and months), I hope to get a number of things posted with regards to the current versions of the tools (aka, SQL Server 2005).  As I get more opportunities to start using the new release, I'll start to post about that.  Hopefully, you'll find a good nugget or three to take away and use on your next BI implementation...