This is my much overdue follow-up blog from the presentation I gave to the New York SQL Server User Group. In this post, I am going to provide some additional resources to supplement the presentation, check out the following blog post on different performance tuning techniques that can be used for SSIS.
Thank you everyone for attending my webinar on SSIS Performance Tuning, if you missed that presentation you can watch it here: http://pragmaticworks.com/LearningCenter/FreeTrainingWebinars/PastWebinars.aspx?ResourceId=683
Below are some of the questions that I received from the Webinar:
Q: Can you give some performance tips in using script tasks too?
A: Yea, don’t use them! Script tasks can be really bad for performance and many times I have been able to replace someone else’s code with native SSIS components. For example, I had situation where the client was using some very complicated .net code to parse out columns based on logic. The package was taking over an hour to process 1 million records. I replaced this with some conditional splits and derived columns and it now runs in 3 minutes.
Q: I am assuming that the file formats must be the same for all files when using the MultiFlatFile transform, correct?
A: You are absolutely correct. The metadata in each file must match.
Q: PW delivers a ‘TF Advanced Lookup Cache Transform” component. What are the benefits of using this component over the Cache Transform covered earlier? It seems that the TF components cannot make use of the same result set when the data set is role based.
A: For basic caching of data I would use the native SSIS cache transform. The major advantage you get from the Task Factory component is you can do very difficult range lookups with ease and they will perform at a high level. Please see my blog post on this.
Q: What version of SQL Server is being used?
A: I was using SQL 2012, but everything in the presentation is applicable to 2005 and 2008.
Q: With the multi flatfile connection manager can you specify specific types?
A: Yes, the wild card character can be anywhere in the connection string property. So you could do test*.txt to only pull in text files where the file name begins with test.
Q: Why would you ever not use table or view (fast load) option in the OLEDB Destination?
A: Well I personally would always use that option. However, with the fast load option all records are committed for the entire batch. So if there is a record that is bad and causes the failure you will not know which record caused the error. With table or view option each record is committed individually so you know exactly which record caused the failure.
Q: is the volume on?
A: Sorry can you ask that again, I couldn’t hear you….
Why does the Cache Transform in SSIS have a file cache option? When would I use this, especially when compared with using the default option which is to store the results in memory? One downside of the default option of storing the results in memory is that the cache is cleared immediately once the package is completed and that memory cache cannot be shared among other packages.
However, the File connection option will allow the results to be reused among multiple packages. But hold up, that’s not quite as good as you might think, each package still must read from the file and then subsequently load that file into memory!
So why are we discussing this if it’s essentially doing the same thing as the default option? Good point, and well received, did you read the title?
Imagine now that you have a rather large table that you need to access many times during your nightly ETL, this is very common in data warehousing situations. You can now lessen the effect of a poor performing network by using the file cache option in the cache transform. You can bring the data across the network only once, load it into a file, and in the future you can reference that file on your local machine and avoid going across the network. As you can imagine this can be quite advantageous!
Set up is quite simple as well, see below:
- Bring the cache transform into your data flow and connect it to your source.
- Open the cache transform for editing.
- Select NEW for cache connection manager.
- Select “Use file Cache” and then browse to the location where the file will be stored.
- Next click the Column tab at the top and specify at least one index column.
- Click OK, map the columns, Click OK to close out the transform.
Now that the initial set up is complete I will demonstrate how to reference this file:
- In the next data flow I set up an OLE DB source and connection that output to a Lookup transform.
- Inside the lookup transform select “Cache Connection Manager” on the general screen.
- On the connection screen select your cache connection manager from the drop down list.
- And that’s it!, set up the rest of your lookup transform like you normally would.
Thanks for looking!
There are quite a few different scenarios for when cache transforms can be used to improve usability and performance inside SSIS. In this post I am going to quickly outline a scenario where cache connection managers can greatly improve performance when loading a fact table.
During the pre-execution phase of package execution SSIS “caches” the lookups inside your package into memory, unfortunately if you are using OLE DB connections in your Data Flow Task this caching happens in a serialized format (one after the other). Typically this isn’t a big deal because you may only have one or two large dimension lookups and all others are very small in size and very fast. But what if this isn’t the case? What if you have 5, 6, 7, or more dimensions all with millions of records that you must cache before the package begins execution? Here is how we can get around this little issue.
- We can have all of our lookups cached in parallel if we use the cache transform in a data flow task prior to the loading of the fact load. If you have never used the cache transform before you can find a quick demo here: http://beyondrelational.com/modules/2/blogs/82/posts/13209/msbi-60-ssis-24-data-flow-task-10-cache-transform-task-detailed-information-amp-explanation-with-exa.aspx
- Inside the data flow tasks we will set each source to point to our cache transforms. Each one of the Cache Transforms will require a connection manager to store the data.
- Below is a screenshot of the connection managers created by the Cache transforms:
- Now that the cache connection managers have been set up, you will need to point each of your lookup transforms to the “Cache Connection manager” instead of the default of OLE DB connection manager.
- Now to the last step. Now click on the connection tab and point to the cache connection manager that stores the information for your lookup:
- All other configuration steps are the same as normal. Thanks for looking, and hope this helped.