Being able to run multiple stored procedures in parallel can provide a major performance boost when loading a data warehouse. In this post, I’ll describe a process which will take full advantage of server resources by running multiple procedures in parallel.
The system I am working with consists of a stored procedure for each table in a staging database. Each procedure truncates and reloads the data for a single table. Even if this process is different than your own, the concept can be applied to any situation.
There are multiple ways to execute items in parallel, but I find SQL Server Integration Services (SSIS) to be the best choice in most cases. This solution requires two SSIS packages. The first package, which I refer to as the shell is used to create multiple threads. Each of these threads will execute an instance of the second SSIS package – which I refer to as the inner package. The inner package is what is used to call the actual stored procedures.
The shell package will set up multiple threads, each of which will continue to loop through a set of procedures that need to be run. Until there is nothing left to run, at which point all your loops/threads will stop. To easily maintain the entire solution, I have created a specialized schema. This schema maintains order of my SP’s so that I don’t run anything out of order. It also logs SP starts, successful completions, and errors.
My project is called Dependency Queue Manager. I have two packages, QueueManager.dtsx and RunProcedure.dtsx. The shell package is QueueManager, and the inner package is RunProcedure. For the project, I have created a connection manager. This connection manager will be used for both packages. Obviously, you can set yours up differently. For example, if you may have a server dedicated to SSIS and store your data on another server. Another example might be if you are trying to set up different environments.
The first thing you need to do when creating your shell (QueueManager.dtsx) is set the package properties. Since we don’t want the whole process to fail in the event that one thread fails, we will be using Event Handlers, so we can set MaximumErrorCount = 0. Additionally, since we want to run multiple threads, we can set MaxConcurrentExecutables = -1. Feel free to set other properties as needed, although these are the only ones that are absolutely necessary for parallel execution.
After setting your package properties, you need to decide how many threads you will have running simultaneously. For my example, I’ve decided to run 30 at a time. As a result, I have created a set of variables: IsExecuting1, IsExecuting2, IsExecuting3, … IsExecuting30. I have tied each of these variables to a corresponding thread (i.e. IsExecuting1 points to WorkerThread1, etc.) so that I can maintain the status of each thread. Additionally, the ExecutingThreadCount will maintain the status of all the threads together, so I know when all the threads have stopped running because all of my SP’s have completed. I have set a formula for this variable so that it equals the sum of all the thread status variables.
Before getting to my threads, I have to grab the batch ID to which this load pertains. This batch ID will be mapped to a package variable, which I will populate via an Execute SQL Task called “Get New Batch ID,” which calls a SP to give me the appropriate ID. The task utilizes the project connection manager DBConnection. In addition, the task is parameterized so I can grab the result and populate my variable. Make sure to set MaximumErrorCount = 1, since we don’t want to run the package without a batch.
I’ve used a Sequence Container called “Parallel Threads” to hold my multiple threads. You can see the 30 threads inside, each of which is named appropriately (i.e. thread 1 is named WorkerThread1, etc.). I will explain the individual thread settings below.
As an example for all the threads, I will use WorkerThread1. Note that the all the threads are configured the same way. The only difference is that any numbers referring to that thread are specific to that thread (i.e. WorkerThread1 and Execute RunProcedure Package 1; WorkerThread2 and Execute RunProcedure Package2; WorkerThread3 and Execute RunProcedure Package3; etc.). Also, make sure to set MaximumErrorCount = 0 on all the individual objects for the same reason as above.
I have managed the threads with Sequence Containers. The For Loop Container contains an initial expression that sets the variable for its thread (IsExecuting1, etc.) to 1 and an evaluation expression that tells the loop to continue to run as long as the status of all the threads (ExecutingThreadCount) is above 0, meaning at least one thread in the package is still running.
Each For Loop Container contains an Execute Package Task, which executes the inner package, RunProcedure.dtsx. Based on the initial and evaluation expressions. Make sure the Execute Package Tasks are named according to their associated threads (“Execute RunProcedure Package n”). Set the properties in the Package tab appropriately, and create some parameter bindings as well. RunProcedure.dtsx utilizes parameters so that it knows which thread is running it and then can set the status of the thread (IsExecuting1, etc.) based on the status of the SP it runs. (Don’t worry if this sounds confusing. It will become clear below.) The project uses this as the relationship between the shell and inner packages so that the individual thread statuses and the overall status can change appropriately. The parameters will be ParentTaskName, which is the system name of the Execute Package Task, and RunBatchID, which is the user variable RunBatchID that we created at the beginning of the project.
As you noticed above, RunProcedure.dtsx utilizes parameters, which the threads in QueueManager.dtsx populate, so we need to create those. The parameters RunBatchID and ParentTaskName hold the values passed by QueueManager.dtsx. I also created some variables to hold these parameters during execution. ParentTaskName holds the value from the parameter. In addition, I created two other variables to house the information of the actual SP that we will run in the package: ProcedureID and ProcedureName will be populated from one of the tasks, which I will show below.
The first task in the package grabs the SP that is going to run in this thread. It calls a SP from my schema, which will return information for the next SP that should run. I pass in the RunBatchID variable so the SP checks the correct batch, and the SP will return information for the SP ID and SP name, which we will use to populate our two variables ProcedureID and ProcedureName, respectively.
The SP will either return the ID of the next SP to be run or an ID of 0. If it’s a real ID, then the package will run the associated SP; otherwise, the package ends, as an ID of 0 means nothing is ready to run. In either case, the first task is a Script Task. Additionally, if there is an SP to run, we use an Execute SQL Task to call the actual SP itself.
The condition that states there is a SP to run (the left side):
The condition that states there is no SP to run (the right side):
After either condition, we create a Script Task. Both Script Tasks are actually exactly the same, except for a minor difference in the code. Additionally, be sure to add the following variables to the ReadWriteVariables section so that RunProcedure.dtsx works, no matter which WorkerThread from QueueManager.dtsx runs it:
User::ParentTaskName, IsExecuting1, IsExecuting2, IsExecuting3, IsExecuting4, IsExecuting5, IsExecuting6, IsExecuting7, IsExecuting8, IsExecuting9, IsExecuting10, IsExecuting11, IsExecuting12, IsExecuting13, IsExecuting14, IsExecuting15, IsExecuting16, IsExecuting17, IsExecuting18, IsExecuting19, IsExecuting20, IsExecuting21, IsExecuting22, IsExecuting23, IsExecuting24, IsExecuting25, IsExecuting26, IsExecuting27, IsExecuting28, IsExecuting29, IsExecuting30
The scripts grab the value of the ParentTaskName variable and grab the number (i.e. 1,2,3,… 30) to determine which thread is running this iteration of RunProcedure.dtsx. It then calls the variable IsExecuting (1,2,3,…30) and sets it to either 1 or 0, depending on whether there is a SP to run or not. Setting the variable to 1 or 0 is the only difference in the code.
Script if there is a SP to run (the left side):
Script if there is no SP to run (the right side):
The IsExecuting variable gets passed back to QueueManager.dtsx to determine that particular thread’s status, as well as the overall status. These statuses are used to determine when to stop looping through all the threads: when there are no more SP's to run, every thread will return 0, so the overall status will equal 0, so all the loops will stop. This means that the ETL process has completed. (Keep in mind that the util schema that I am using in the backend keeps track of errors, so I will only be running SP's that are actually ready to run, so that SP's that fail are not an issue here.)
The final task is an Execute SQL Task. Here is where we actually execute the SP that needs to run in this package. There is no parameter mapping or result set. This time, we have the SP name from the variable ProcedureName, so we create an expression which will run the SP on the server and database in our connection manager. (By the way, the ProcedureName variable will actually have a fully-qualified domain name, so there won't be any ambiguity for which SP will run if you have multiple SP's with the same names, but in different schemas or databases that you need to run. This is especially helpful if you are using a maintenance or admin database that houses the SP metadata for all your db’s on one server.)
After running the individual SP, it is really helpful to log the status of the SP itself, especially since in our particular configuration, we've used the SP statuses to determine which SP's to run and in what order. For successful runs, we have created an Execute SQL Task on the Control Flow with a precendent constraint of success on the "Execute SP" Execute SQL Task. In the event of a failure, we have implemented an OnError Event Handler, which will log any errors that will occur. Both of these Execute SQL Tasks call the same SP with several parameters to log the SP status: ProcedureID, RunBatchID, Status, and ErrorMessage (defaulted to NULL in case of success). For a successful run, ErrorMessage is left blank, while we use the SSIS system error message variable to log a failure.
Success (Control Flow Task):
Failure (OnError Event Handler):
You may be wondering whether it is even worth implementing this process. Let me tell you why:
- First, this is a relatively easy project to make. Aside from the time spent copying and renaming each thread (definitely the fastest way to make them all, especially if you alter them with the backend XML and not in the GUI), it really doesn’t take much to do this.
- 2. Take a look at the performance gain for our particular load. Obviously other things could probably be done to fully optimize your ETL process, but, simply by making our process parallel, we’ve improved by roughly 38%, going from 3:28:16 to 2:08:18. I highly recommend you test this out and see how it helps you out.
A Look at Both Designers: