This is my much overdue follow-up blog from the presentation I gave to the New York SQL Server User Group. In this post, I am going to provide some additional resources to supplement the presentation, check out the following blog post on different performance tuning techniques that can be used for SSIS.
This blog post is a follow up to a question I received when I gave my Advanced TSQL Webinar for Pragmatic Works. If you haven’t seen that yet and would like to see it you can view the webinar here:
Question: How can we get SubTotals in TSQL using the CTE method you mentioned?
In my webinar I showed how to get totals and sub totals in the same result set as line item detailed information (See screenshot below). The method I used involved using the OVER clause and it kept the SQL very clean and easy to read. Unfortunately this method is not the best performing option available and that is because the over clause without framing uses disk. (I have pasted the TSQL example with the over clause at the bottom of this blog post for comparison and reference purposes.)
Sub Totals in SQL with CTE method:
First of all can I just preface this by saying I love CTEs? (Common Table Expressions). Let’s jump right in and write some code! For this example I’m going to be using the AdventureWorks2012 database but this should work with all versions of Adventure Works.
Step 1) – Create CTE with total information:
WITH TOTALS AS ( SELECT CustomerID, SUM(TotalDue) AS Total FROM Sales.SalesOrderHeader GROUP BY CustomerID ) SELECT * FROM Totals ORDER BY CustomerID
Step 2: Create an SQL query with line item detail information.
SELECT CustomerID, SalesOrderID, OrderDate, TotalDue FROM Sales.SalesOrderHeader
Step 3: Join them together!
WITH TOTALS AS ( SELECT CustomerID, SUM(TotalDue) AS Total FROM Sales.SalesOrderHeader GROUP BY CustomerID ) SELECT soh.CustomerID, SalesOrderID, OrderDate, TotalDue, Total FROM Sales.SalesOrderHeader soh JOIN Totals t ON t.CustomerID = soh.CustomerID
As I mentioned above you can get the same results using the OVER Clause in TSQL. I have pasted the code below for that example:
SELECT CustomerID, SalesOrderID, OrderDate, TotalDue, SUM(TotalDue) OVER(Partition By CustomerID) AS CustomerTotal FROM Sales.SalesOrderHeader
Final Thoughts: This method will generally perform better than simply using the over clause method, but it takes a more code and work. If the over clause function gets the job done and performance is not an issue I would recommend using that method to keep the code simpler and easier to read!
Thanks for looking!
Quite some time back I found myself fighting with an Execution Plan generated by SQL Server for one of my stored procedures. The execution plan always returned an estimated row of “1” when processing for the current day. I won’t go into details on why this one specific stored procedure didn’t use the older cached plans as expected. I will however tell you like most things with SQL Server there is more than one way to solve a problem .
This method is something I have personally wanted to blog because it’s something I have only used a handful of times when I just couldn’t get the execution plan to work the way I wanted it to. Note that using this hint we are forcing the SQL Server Optimizer to use the statistics for the specific variable value that we provide. However if the table was to grow significantly in the future we may be hurting performance by forcing a bad execution plan and that is a drawback to using this hint, so now you know!
Take a look at the two screenshots below. The first is the estimated rows from the Fact Internet Sales table and the second is the estimated execution plan.
What I actually want to see for this execution plan is HASH MATCH. This will perform significantly better for the number of records that I will have. Unfortunately due to out of date statistics I’m getting a bad plan.
So let’s note two things.
- First, in most situations the best solution here is to simply update statistics. This should be part of ANY database maintenance plan.
- Second, The example I am using here is not great. I am simply forcing the plans to do what I want for demo purposes.
Let’s take a look at the original query:
DECLARE @ShipDate DATE = '1/1/2008' SELECT [EnglishProductName] AS Product ,[SalesOrderNumber] ,[OrderDate] ,[DueDate] ,[ShipDate] FROM [dbo].[FactInternetSales_Backup] FIS JOIN [dbo].[DimProduct] DP ON DP.ProductKey = FIS.ProductKey WHERE ShipDate > @ShipDate
Now we are going to modify this query quickly to use the Optimize For hint. This hint is going to allow us to optimize our Execution Plan in SQL Server using the specified parameter. In my instance this is going to be a previous date where I know the statistics are reflective of what I want to see in my execution plan.
Here is the modified query:
DECLARE @ShipDate DATE = '1/1/2008' SELECT [EnglishProductName] AS Product ,[SalesOrderNumber] ,[OrderDate] ,[DueDate] ,[ShipDate] FROM [dbo].[FactInternetSales_Backup] FIS JOIN [dbo].[DimProduct] DP ON DP.ProductKey = FIS.ProductKey WHERE ShipDate > @ShipDate OPTION (OPTIMIZE FOR (@ShipDate = '1/1/2005')) GO
In this query the result set returned will still be for the original value of the variable “1/1/2008’. However the SQL Server optimizer is going to generate the plan using the OPTIMIZE FOR hint that we provided. (Highlighted in Yellow).
Now let’s take a look at our new Estimated Execution plan:
This time we are getting a Hash Match which is much more applicable for our table and the number of records that will be queried.
As always, Thanks
With the Retain Same Connection property I was recently able to more than double the performance of my SSIS package for a client. The package was looping over hundreds of thousands of files and logging the file names into a table to be later processed.
Retain Same Connection is a property setting found on connection managers. By default this property is set to false which means that each time the connection manager is used the connection is opened and subsequently closed. However in a situation like mine this can significantly degrade overall performance as the package has to open and close that connection hundreds of thousands of times. In this blog I’m going to set up a very basic example and walk through setting up this connection manager.
Here I have set up two connection managers. Both connection managers point to the same database. It’s important to note that if you are using Project Level Connection managers in SQL 2012 that setting this property inside any one package will persist across all packages. Therefore I create two connection managers.
Inside the package I am simply using a for each loop task to loop through a list of files in a directory and then I load the file names into a table using an Execute SQL task. For demo purposes I have two examples in one package.
- The FELC on the left takes 30 Seconds to run.
- The FELC on the right takes 11 Seconds to run.
Let’s now discuss how and where we can set this property.
Right click on your connection manager found in the connection managers pain inside the package and select Properties.
Inside the properties window find the property “RetainSameConnection” and set the value to “True”. Now the connection will remain open for the duration of the package.
As always thanks for looking!
Thank you everyone for attending my webinar on SSIS Performance Tuning, if you missed that presentation you can watch it here: http://pragmaticworks.com/LearningCenter/FreeTrainingWebinars/PastWebinars.aspx?ResourceId=683
Below are some of the questions that I received from the Webinar:
Q: Can you give some performance tips in using script tasks too?
A: Yea, don’t use them! Script tasks can be really bad for performance and many times I have been able to replace someone else’s code with native SSIS components. For example, I had situation where the client was using some very complicated .net code to parse out columns based on logic. The package was taking over an hour to process 1 million records. I replaced this with some conditional splits and derived columns and it now runs in 3 minutes.
Q: I am assuming that the file formats must be the same for all files when using the MultiFlatFile transform, correct?
A: You are absolutely correct. The metadata in each file must match.
Q: PW delivers a ‘TF Advanced Lookup Cache Transform” component. What are the benefits of using this component over the Cache Transform covered earlier? It seems that the TF components cannot make use of the same result set when the data set is role based.
A: For basic caching of data I would use the native SSIS cache transform. The major advantage you get from the Task Factory component is you can do very difficult range lookups with ease and they will perform at a high level. Please see my blog post on this.
Q: What version of SQL Server is being used?
A: I was using SQL 2012, but everything in the presentation is applicable to 2005 and 2008.
Q: With the multi flatfile connection manager can you specify specific types?
A: Yes, the wild card character can be anywhere in the connection string property. So you could do test*.txt to only pull in text files where the file name begins with test.
Q: Why would you ever not use table or view (fast load) option in the OLEDB Destination?
A: Well I personally would always use that option. However, with the fast load option all records are committed for the entire batch. So if there is a record that is bad and causes the failure you will not know which record caused the error. With table or view option each record is committed individually so you know exactly which record caused the failure.
Q: is the volume on?
A: Sorry can you ask that again, I couldn’t hear you….