Use the aggregate transform in SSIS to out perform T-SQL!

I know I got your attention with that title. If you speak to any experienced BI Developer you will always be informed that you should perform all of your aggregations in T-SQL where possible in the source and not using the SSIS components. To this point I agree, well most of the time anyway. There are however times when using SSIS will out perform T-SQL if all the stars align. Let me explain.

Recently I was using a CTE and the row_number function in T-SQL to bring back only distinct customers from a list of transactions. This result set was then used to insert inferred members into the customer dimension, if that customer did not yet exist of course. Once the result set returned the list of distinct customers we performed a lookup against the customer dimension and we loaded the non matching customers into the customer dimension.

Now let me note, the T-SQL query honestly wasn’t that bad, it was only taking around 3 minutes to run. The bigger problem here was that the query was using temp db and this was causing locking issues with a more advanced design pattern (Partition switch) that the SQL Server Engine could not resolve.

So, we turned to good ole SSIS to help us out.

  • First, bring all records from the source exactly as they were, all 30 million transactions.
  • Next, perform the lookup to the customer dimension and redirect non matching rows to aggregate transform.
  • Finally, use the aggregate transform to group the new customers and then load the distinct values into the customer dimension.

image

This works because we filter out all the existing customers before doing the aggregation in SSIS and therefore only small percentage of the records require aggregation. The data flow task to load inferred members runs in approximately 80 seconds, where as the original T-SQL query alone was taking over 3 minutes.

Thanks for looking!

Advertisements

Using cache transforms in SSIS to improve Performance (Part 1)

There are quite a few different scenarios for when cache transforms can be used to improve usability and performance inside SSIS. In this post I am going to quickly outline a scenario where cache connection managers can greatly improve performance when loading a fact table.

During the pre-execution phase of package execution SSIS “caches” the lookups inside your package into memory, unfortunately if you are using OLE DB connections in your Data Flow Task this caching happens in a serialized format (one after the other). Typically this isn’t a big deal because you may only have one or two large dimension lookups and all others are very small in size and very fast. But what if this isn’t the case? What if you have 5, 6, 7, or more dimensions all with millions of records that you must cache before the package begins execution? Here is how we can get around this little issue.

image

  • Inside the data flow tasks we will set each source to point to our cache transforms. Each one of the Cache Transforms will require a connection manager to store the data.

image

  • Below is a screenshot of the connection managers created by the Cache transforms:

image

  • Now that the cache connection managers have been set up, you will need to point each of your lookup transforms to the “Cache Connection manager” instead of the default of  OLE DB connection manager.

image

  • Now to the last step. Now click on the connection tab and point to the cache connection manager that stores the information for your lookup:

image

  • All other configuration steps are the same as normal. Thanks for looking, and hope this helped.