Load thousands of files in SSIS lighting fast! MultiFlatFile Connection Manager.

Recently I was working on a project where we needed to process upward of 12,000 files per day. As most would expect we used the Foreach Loop container in SSIS to loop through all 12,000 files. The results? A whopping 77 minutes to run. WOW! (Incase you were wondering this is WOW that’s AWEFUL!, not wow that’s great)

I assume the excessive time is here is due to the time it takes to dynamically set the flat file connection manager and close/reopen the connection. I don’t know exactly how long this process takes but hypothetically let’s say it takes 1 second for every 3 iterations of the loop.

Now some simple math. 12,000 files / 3 (files per second) / 60 seconds = 67 minutes.

So how did we get around this problem? Well we tried two different design patterns. The first and the one I will be blogging about here is the “MultiFlatfile” connection manager. The second, more complicated to set up, but extremely efficient can be found here: Anthony Martin.

Let’s performance test the MultiFlatFile connection manager!

For this post I created three packages.

  • The first package generates 1000 flat files for testing.
  • The second package iterates through all 1000 files. Runtime (22 Minutes)
  • The third package uses the multiflatfile connection to bring in all the files instantaneously. (13 Seconds)

The Foreach loop got utterly destroyed in this contest. The multiflatfile connection manager was able to process all 1000 files in only 13 seconds!! Let’s look at how to set this up.

  1. Right click in the connection manager window and select New Connection > MultiFlatFile > Click Add
  2. Set up the general screen like your typical Flat File connection, you will need to hardcode an existing file.
  3. Click ok to close the connection manager editor, verify the connection manager is selected and then open up the properties window.
  4. In the properties window, scroll down to Expressions > Click the Ellipsis “…” and select “Connection String” from the drop down menu.
  5. Next click  the ellipsis to the right to open up the Expression editor. This part is critical because this is where we set an expression that will determine which files are extracted from the directory. For this simple example my expression was “C:\\Blogs\\1000_Flat_Files\\Test*” (Don’t miss the asterisk here!). The connection manager will now retrieve all files that start with “Test”. Screenshots for set up below.

image

image

image

image

image

This is an awesome alternative to the For Each Loop and it’s much easier to set up, however it has some obvious drawbacks with how flexible it is. If you discover this pattern does not perform as well as you hoped or is not flexible enough please review the blog post I linked above by Anthony Martin. It takes a little more set up but offers incredible flexibility and is the best performing solution!

As always, thanks for looking!

SSIS Tips and Tricks, the Select * counter punch.

Working with SSIS and the BI Stack from Microsoft I have discovered a few tips and tricks that come in quite handy. To my surprise a lot of these are not very well know in the BI World. In this series I am going to try and post one or two SSIS Tips and Tricks on a weekly basis.

As you are well aware the SELECT * syntax in SQL Server should not be used inside of SSIS. This is because the metadata in SSIS is very rigid and if you add or remove columns to the table in your select statement it can affect other transforms in the pipeline that have been defined with very specific metadata properties.

So if you are like many people who use SSIS it’s likely you alt + tab over to SQL Server Management Studios and then view the table and write your select syntax there with all the applicable columns for your query. Then you copy that syntax and alt + tab back over to BIDS / SSDTs and paste.

It’s unfortunate that there isn’t some kind of tool inside SSIS that will assist us with our incredibly difficult predicament. Just kidding, of course there is, if there wasn’t I wouldn’t be blogging about it! Let’s demonstrate.

First – Open up your OLE DB Source component in your data flow and write your Select * from table statement. For this example I used SELECT * FROM SALES.CUSTOMER in the AdventureWorks2012 database.

Next click “Build Query…” found on the right side of the OLE DB Source Editor.

image

image

Click ok and you are done! Thanks for looking and check back for more time saving tips and tricks.

Why are Asynchronous components so bad in SSIS?

If you have spent anytime with SSIS you have heard that Asynchronous components also known as “Blocking” components are bad for performance and should be avoided where possible. You have also heard some analogy or another that explained how these components “block” the data in the data flow and slow down performance. But I’m here to tell you that it gets much worse.

The data flow sends data down in buffers. Each buffer in the data flow will first utilize memory if available, the memory that is captured for that buffer of records will be released once the data flow is done with that buffer. For example once those records in that buffer have been loaded to the destination the buffer and the corresponding memory is released. This memory can now be used for the next buffer of memory coming down the pipeline in the data flow task. This is where the true evil with asynchronous components come in.

What happens if the buffer is not released because it got caught up in the data flow at some blocking transform?  In a worse case scenario your machine/server becomes starved for memory and then SSIS is forced use disk instead of memory. Too many buffers spooled to disk can exponentially increase your package run time. Let me demonstrate.

For this example I have created two packages. Each package uses a sorting transform, however in different locations.

Package One. This package takes an unknown amount of time to run, I cut it off after approximately 68 minutes. The first screenshot shows the package in it’s current state, the second and third screenshot give some insight into why this is taking so long to run.

Screenshot 1:

image

The following screenshot is obtained from SSMS after I run the package from integration services and then run the following query against the SSISDB.

  • “SELECT * FROM [catalog].[dm_execution_performance_counters] (NULL) “

Screenshot 2:

image

Screenshot 3: (Here we see the execution performance from the SSISDB)

image

Unfortunately there are times when you just have no choice and must use asynchronous components in SSIS. One option to consider, can the blocking transform be placed in a more strategic location to minimize the negative impact to performance? For example. In package one the sorting transform was placed before the lookup, the lookup is only forwarding non matching records to the destination, so why do the matching records need to be sorted? If matching records don’t need to be sorted then we can move the blocking transform to after the lookup and keep everything in memory.

Package Two: (This package completes in 1 minute and 20 seconds)

Screenshot 1:

image

Screenshot 2:

image

Screenshot 3:

image

Thanks for looking!

Use the aggregate transform in SSIS to out perform T-SQL!

I know I got your attention with that title. If you speak to any experienced BI Developer you will always be informed that you should perform all of your aggregations in T-SQL where possible in the source and not using the SSIS components. To this point I agree, well most of the time anyway. There are however times when using SSIS will out perform T-SQL if all the stars align. Let me explain.

Recently I was using a CTE and the row_number function in T-SQL to bring back only distinct customers from a list of transactions. This result set was then used to insert inferred members into the customer dimension, if that customer did not yet exist of course. Once the result set returned the list of distinct customers we performed a lookup against the customer dimension and we loaded the non matching customers into the customer dimension.

Now let me note, the T-SQL query honestly wasn’t that bad, it was only taking around 3 minutes to run. The bigger problem here was that the query was using temp db and this was causing locking issues with a more advanced design pattern (Partition switch) that the SQL Server Engine could not resolve.

So, we turned to good ole SSIS to help us out.

  • First, bring all records from the source exactly as they were, all 30 million transactions.
  • Next, perform the lookup to the customer dimension and redirect non matching rows to aggregate transform.
  • Finally, use the aggregate transform to group the new customers and then load the distinct values into the customer dimension.

image

This works because we filter out all the existing customers before doing the aggregation in SSIS and therefore only small percentage of the records require aggregation. The data flow task to load inferred members runs in approximately 80 seconds, where as the original T-SQL query alone was taking over 3 minutes.

Thanks for looking!

Using cache transforms in SSIS to improve Performance (Part 1)

There are quite a few different scenarios for when cache transforms can be used to improve usability and performance inside SSIS. In this post I am going to quickly outline a scenario where cache connection managers can greatly improve performance when loading a fact table.

During the pre-execution phase of package execution SSIS “caches” the lookups inside your package into memory, unfortunately if you are using OLE DB connections in your Data Flow Task this caching happens in a serialized format (one after the other). Typically this isn’t a big deal because you may only have one or two large dimension lookups and all others are very small in size and very fast. But what if this isn’t the case? What if you have 5, 6, 7, or more dimensions all with millions of records that you must cache before the package begins execution? Here is how we can get around this little issue.

image

  • Inside the data flow tasks we will set each source to point to our cache transforms. Each one of the Cache Transforms will require a connection manager to store the data.

image

  • Below is a screenshot of the connection managers created by the Cache transforms:

image

  • Now that the cache connection managers have been set up, you will need to point each of your lookup transforms to the “Cache Connection manager” instead of the default of  OLE DB connection manager.

image

  • Now to the last step. Now click on the connection tab and point to the cache connection manager that stores the information for your lookup:

image

  • All other configuration steps are the same as normal. Thanks for looking, and hope this helped.

How to get the column with the max date or number in SSIS! (The Easy Way)

Recently when loading a snapshot fact table I was required to populate one of the columns with the the most recent revenue generating activity from all activities. The catch? This information was stored in about 12 different fact tables. I was already storing the last revenue generating activity for each fact table in it’s own individual column, but now I needed to figure out how to take the greatest date from all 12 columns and populate the last revenue generating activity across all fact tables. Make sense? Fear not, I have pictures!

To simulate the problem lets pretend that we run an internet based company and we have a separate fact table for item purchases, gift cards purchases, and services. The first screenshot below shows how the last revenue generated activity can differ for each customer.

image

This screenshot is the desired result. Notice how the new column has taken the MAX date from the correct column for each customer!

image

So how did we get accomplish this goal in less than a minute?? Awesome question, hold on, we are almost there.

I decided to take advantage of one of the 194 functions offered in the Task Factory Advanced Derived Column transform (Greatest). Oh and this works for more than just dates.

The two screenshots below demonstrate just how easy it is to set this up. Thanks and enjoy!

image

image

 

Outlook Email Tips on Searching Emails

I, like everyone, get an incredible amount of email on a daily basis. I recently learned some awesome tricks for searching my outlook that I wanted to share!

In the past when I wanted to search for something I would just type that item into the search box. For example ‘John Doe’. Unfortunately this returned a lot emails that John Doe not only might have sent me but also was cc’d or bcc’d in. It also appeared that if his name was in the subject that email would be returned. Well when you are working on a project in which multiple people are cc’d on every email this results in an overwhelming number of results.

So now to the good part!! Want to search for emails that you received from John Doe? Try this

From:John Doe (Note you don’t have to fully fill out his name.)

From:Joh – This will also return emails from John Doe, but of course anyone else with Joh as well.

To:Jane Doe

Subject:Analytics (This will search the full subject string for the word ‘Analytics’

These are just a couple of the options available. For a full list please check out the website below:

Learn to narrow your search!