SSIS Performance Tuning the For Each Loop. Retain Same Connection.

With the Retain Same Connection property I was recently able to more than double the performance of my SSIS package for a client. The package was looping over hundreds of thousands of files and logging the file names into a table to be later processed.

Retain Same Connection is a property setting found on connection managers. By default this property is set to false which means that each time the connection manager is used the connection is opened and subsequently closed. However in a situation like mine this can significantly degrade overall performance as the package has to open and close that connection hundreds of thousands of times. In this blog I’m going to set up a very basic example and walk through setting up this connection manager.

Here I have set up two connection managers. Both connection managers point to the same database. It’s important to note that if you are using Project Level Connection managers in SQL 2012 that setting this property inside any one package will persist across all packages. Therefore I create two connection managers.

image

Inside the package I am simply using a for each loop task to loop through a list of files in a directory and then I load the file names into a table using an Execute SQL task. For demo purposes I have two examples in one package.

image

  1. The FELC on the left takes 30 Seconds to run.
  2. The FELC on the right takes 11 Seconds to run.

Let’s now discuss how and where we can set this property.

Right click on your connection manager found in the connection managers pain inside the package and select Properties.

image

Inside the properties window find the property “RetainSameConnection” and set the value to “True”. Now the connection will remain open for the duration of the package.

image

As always thanks for looking!

Advertisements

SSIS Performance Tuning Webinar–Q & A

Thank you everyone for attending my webinar on SSIS Performance Tuning, if you missed that presentation you can watch it here: http://pragmaticworks.com/LearningCenter/FreeTrainingWebinars/PastWebinars.aspx?ResourceId=683

Below are some of the questions that I received from the Webinar:

Q: Can you give some performance tips in using script tasks too?

A: Yea, don’t use them! Script tasks can be really bad for performance and many times I have been able to replace someone else’s code with native SSIS components. For example, I had situation where the client was using some very complicated .net code to parse out columns based on logic. The package was taking over an hour to process 1 million records. I replaced this with some conditional splits and derived columns and it now runs in 3 minutes.

Q:  I am assuming that the file formats must be the same for all files when using the MultiFlatFile transform, correct?

A: You are absolutely correct. The metadata in each file must match.

Q: PW delivers a ‘TF Advanced Lookup Cache Transform” component.  What are the benefits of using this component over the Cache Transform covered earlier?  It seems that the TF components cannot make use of the same result set when the data set is role based.

A: For basic caching of data I would use the native SSIS cache transform. The major advantage you get from the Task Factory component is you can do very difficult range lookups with ease and  they will perform at a high level. Please see my blog post on this.

Q: What version of SQL Server is being used?

A: I was using SQL 2012, but everything in the presentation is applicable to 2005 and 2008.

Q: With the multi flatfile connection manager can you specify specific types?

A:  Yes, the wild card character can be anywhere in the connection string property. So you could do test*.txt to only pull in text files where the file name begins with test.

Q: Why would you ever not use table or view (fast load) option in the OLEDB Destination?

A: Well I personally would always use that option. However, with the fast load option all records are committed for the entire batch. So if there is a record that is bad and causes the failure you will not know which record caused the error. With table or view option each record is committed individually so you know exactly which record caused the failure.

Q: is the volume on?

A: Sorry can you ask that again, I couldn’t hear you….

Thanks again!

Use File Cache and the Cache Transform to help mitigate network issues.

Why does the Cache Transform in SSIS have a file cache option? When would I use this, especially when compared with using the default option which is to store the results in memory? One downside of the default option of storing the results in memory is that the cache is cleared immediately once the package is completed and that memory cache cannot be shared among other packages.

However, the File connection option will allow the results to be reused among multiple packages. But hold up, that’s not quite as good as you might think, each package still must read from the file and then subsequently load that file into memory!

So why are we discussing this if it’s essentially doing the same thing as the default option? Good point, and well received, did you read the title?

Imagine now that you have a rather large table that you need to access many times during your nightly ETL, this is very common in data warehousing situations. You can now lessen the effect of a poor performing network by using the file cache option in the cache transform. You can bring the data across the network only once, load it into a file, and in the future you can reference that file on your local machine and avoid  going across the network. As you can imagine this can be quite advantageous!

Set up is quite simple as well, see below:

    1. Bring the cache transform into your data flow and connect it to your source.
    2. Open the cache transform for editing.
    3. Select NEW for cache connection manager.
    4. image
    5. Select “Use file Cache” and then browse to the location where the file will be stored.
    6. Next click the Column tab at the top and specify at least one index column.
    7. image
    8. Click OK, map the columns, Click OK to close out the transform.

Now that the initial set up is complete I will demonstrate how to reference this file:

    1. In the next data flow I set up an OLE DB source and connection that output to a Lookup transform.
    2. Inside the lookup transform select “Cache Connection Manager” on the general screen.
    3. image
    4. On the connection screen select your cache connection manager from the drop down list.
    5. image
    6. And that’s it!, set up the rest of  your lookup transform like you normally would.

Thanks for looking!

Create efficient Range Lookups with the Task Factory Advanced Lookup Transform

There are three generally accepted methods to perform the tricky range lookup in SSIS. A great blog post that outlines all three of these methods and how they perform can be found here. Per the post the best performing solution is unfortunately the one that is arguably the most difficult to set up. Fortunately for us we can get all the performance of scripting without all the work!

Scenario:

We need to perform a lookup to the segment table based on the Weekday and Time that the transaction took place. For example, assume we had a transaction take place on Tuesday at 15:30:00. This transaction would need to be mapped to the SegmentKey of 3. Screenshot of segment table below:

image

Now that we have identified our business requirement, let’s look at how we can use the Advanced Lookup Transform in Task Factory to solve this tricky problem.

  1. For this example we will need two data flow tasks.
    1. The first data flow task will load our segment table into the TF ADV Lookup Cache Transform.
      1. Inside the first data flow task I bring in an OLEDB Source and select my Segment table.
      2. Next I bring in the TF Advanced Lookup Cache transform and connect the two transforms. (Screenshot 1)

image

      1. Open the TF Cache transform and select “Create New Lookup Cache Transform” (Screenshot 2)

image

      1. Click the Input Parameter tab at the top. We are going to add two parameters, one for day and one for the time of the actual transaction. In the bottom dialog box we are going to define our conditions for when a lookup should occur. For this example we have three criteria.
        1. First the day of the transaction must equal the day in the Segment table.
        2. Second the begin time must be less than or equal to the time of the transaction.
        3. Finally the end time must be greater than or equal to the time of the transaction.

image

      1. Once the criteria have been entered click ok twice to close out of the cache connection manager.
      2. Now in our second data flow we will pull in our transactional data and perform our lookup. For this example I have written a simple select statement to simulate a transaction.
        SELECT 'Tuesday' AS DayOfWeek, '15:30:00' AS TIME
      3. Now that the source has been set up we need to bring in the TF Advanced Lookup Transform and open it for editing.
      4. Select our Cache Connection manager from the drop down and then map our input columns to our input parameters. Screenshot below.

image

      1. Now execute the package!

image

Thanks for looking.

Load thousands of files in SSIS lighting fast! MultiFlatFile Connection Manager.

Recently I was working on a project where we needed to process upward of 12,000 files per day. As most would expect we used the Foreach Loop container in SSIS to loop through all 12,000 files. The results? A whopping 77 minutes to run. WOW! (Incase you were wondering this is WOW that’s AWEFUL!, not wow that’s great)

I assume the excessive time is here is due to the time it takes to dynamically set the flat file connection manager and close/reopen the connection. I don’t know exactly how long this process takes but hypothetically let’s say it takes 1 second for every 3 iterations of the loop.

Now some simple math. 12,000 files / 3 (files per second) / 60 seconds = 67 minutes.

So how did we get around this problem? Well we tried two different design patterns. The first and the one I will be blogging about here is the “MultiFlatfile” connection manager. The second, more complicated to set up, but extremely efficient can be found here: Anthony Martin.

Let’s performance test the MultiFlatFile connection manager!

For this post I created three packages.

  • The first package generates 1000 flat files for testing.
  • The second package iterates through all 1000 files. Runtime (22 Minutes)
  • The third package uses the multiflatfile connection to bring in all the files instantaneously. (13 Seconds)

The Foreach loop got utterly destroyed in this contest. The multiflatfile connection manager was able to process all 1000 files in only 13 seconds!! Let’s look at how to set this up.

  1. Right click in the connection manager window and select New Connection > MultiFlatFile > Click Add
  2. Set up the general screen like your typical Flat File connection, you will need to hardcode an existing file.
  3. Click ok to close the connection manager editor, verify the connection manager is selected and then open up the properties window.
  4. In the properties window, scroll down to Expressions > Click the Ellipsis “…” and select “Connection String” from the drop down menu.
  5. Next click  the ellipsis to the right to open up the Expression editor. This part is critical because this is where we set an expression that will determine which files are extracted from the directory. For this simple example my expression was “C:\\Blogs\\1000_Flat_Files\\Test*” (Don’t miss the asterisk here!). The connection manager will now retrieve all files that start with “Test”. Screenshots for set up below.

image

image

image

image

image

This is an awesome alternative to the For Each Loop and it’s much easier to set up, however it has some obvious drawbacks with how flexible it is. If you discover this pattern does not perform as well as you hoped or is not flexible enough please review the blog post I linked above by Anthony Martin. It takes a little more set up but offers incredible flexibility and is the best performing solution!

As always, thanks for looking!

Why are Asynchronous components so bad in SSIS?

If you have spent anytime with SSIS you have heard that Asynchronous components also known as “Blocking” components are bad for performance and should be avoided where possible. You have also heard some analogy or another that explained how these components “block” the data in the data flow and slow down performance. But I’m here to tell you that it gets much worse.

The data flow sends data down in buffers. Each buffer in the data flow will first utilize memory if available, the memory that is captured for that buffer of records will be released once the data flow is done with that buffer. For example once those records in that buffer have been loaded to the destination the buffer and the corresponding memory is released. This memory can now be used for the next buffer of memory coming down the pipeline in the data flow task. This is where the true evil with asynchronous components come in.

What happens if the buffer is not released because it got caught up in the data flow at some blocking transform?  In a worse case scenario your machine/server becomes starved for memory and then SSIS is forced use disk instead of memory. Too many buffers spooled to disk can exponentially increase your package run time. Let me demonstrate.

For this example I have created two packages. Each package uses a sorting transform, however in different locations.

Package One. This package takes an unknown amount of time to run, I cut it off after approximately 68 minutes. The first screenshot shows the package in it’s current state, the second and third screenshot give some insight into why this is taking so long to run.

Screenshot 1:

image

The following screenshot is obtained from SSMS after I run the package from integration services and then run the following query against the SSISDB.

  • “SELECT * FROM [catalog].[dm_execution_performance_counters] (NULL) “

Screenshot 2:

image

Screenshot 3: (Here we see the execution performance from the SSISDB)

image

Unfortunately there are times when you just have no choice and must use asynchronous components in SSIS. One option to consider, can the blocking transform be placed in a more strategic location to minimize the negative impact to performance? For example. In package one the sorting transform was placed before the lookup, the lookup is only forwarding non matching records to the destination, so why do the matching records need to be sorted? If matching records don’t need to be sorted then we can move the blocking transform to after the lookup and keep everything in memory.

Package Two: (This package completes in 1 minute and 20 seconds)

Screenshot 1:

image

Screenshot 2:

image

Screenshot 3:

image

Thanks for looking!

Use the aggregate transform in SSIS to out perform T-SQL!

I know I got your attention with that title. If you speak to any experienced BI Developer you will always be informed that you should perform all of your aggregations in T-SQL where possible in the source and not using the SSIS components. To this point I agree, well most of the time anyway. There are however times when using SSIS will out perform T-SQL if all the stars align. Let me explain.

Recently I was using a CTE and the row_number function in T-SQL to bring back only distinct customers from a list of transactions. This result set was then used to insert inferred members into the customer dimension, if that customer did not yet exist of course. Once the result set returned the list of distinct customers we performed a lookup against the customer dimension and we loaded the non matching customers into the customer dimension.

Now let me note, the T-SQL query honestly wasn’t that bad, it was only taking around 3 minutes to run. The bigger problem here was that the query was using temp db and this was causing locking issues with a more advanced design pattern (Partition switch) that the SQL Server Engine could not resolve.

So, we turned to good ole SSIS to help us out.

  • First, bring all records from the source exactly as they were, all 30 million transactions.
  • Next, perform the lookup to the customer dimension and redirect non matching rows to aggregate transform.
  • Finally, use the aggregate transform to group the new customers and then load the distinct values into the customer dimension.

image

This works because we filter out all the existing customers before doing the aggregation in SSIS and therefore only small percentage of the records require aggregation. The data flow task to load inferred members runs in approximately 80 seconds, where as the original T-SQL query alone was taking over 3 minutes.

Thanks for looking!