If you want to analyze or create a particular job, Integration Services (SSIS) package, or other form of ETL in Microsoft SQL Server, one of the first things you may want to do is take a look at all the stored procedures you are running and determine which ones need to run before others.
In our case, we have a complex ETL process that calls dozens of stored procedures. We already have a SQL Agent Job for this ETL, but that runs everything in serial. We decided to try to optimize and shorten our load process by running the SP's in parallel. However, before implementing the change, we first had to know which SP's depend on others. This would ensure we are loading everything in the right order and that no data is bad.
Regardless of what kind of process you're going to implement, it can be extremely helpful to devise a list of SP "dependencies," which basically amounts to a parent-child hierarchy. You may say that when architecting a solution/data warehouse, a team should already have this list, but when you have as many SP's as we do, things can quickly get convoluted and confusing, and it really helps to have a master list. Utilizing the sys schema, we can look at the definition of stored procedures in our database and determine how they each relate to one another.
In the first section, we need to create a temporary schema that will hold our metadata so that we can create our mappings. The first table will hold the reformatted stored procedure definitions, which we will scan to get a list of all the tables used in each stored procedure. The second table will contain the list of associated SP's and tables. The third table will hold our parent-child hierarchy. In this table, a "parent" is a SP that has to run before another SP, and a "child" is a SP that depends on another SP. Not surprisingly, a SP can have multiple children (or none), as well as multiple parents (or none).
In the next section, we get a list of all the SP's and their definitions. Note that we are also grabbing the schemas, in the event that there are two SP's with the same name, but different schemas, in the same database. After the initial list, we reformat the definitions by removing all brackets, duplicate spaces, tabs, line-breaks or endings, and commented areas (since the commented parts are obviously inactive, we do not want to look at any of the code in there; if you do, then just take that part out). Doing this ensures that our SQL will be able to easily search the SP definitions and find the tables we are looking for, to create a list of tables and their associated SP's. (If you notice, I am using a table called util.[Procedures]. This is basically a filtered list of the SP's that I want to run in this particular ETL. You can create your own filtered list of SP's, or you can query one of the other sys tables to get a list of all the SP's in your database.)
In the next section, we create a cursor to loop through all the tables in the below schemas in our database. The cursor will look for any instance of the table names in the list of reformatted SP definitions we just created in the last step. It will look for the table names in the format DB.Schema.Table or Schema.Table (if you remember, we removed all the brackets, so this is all we have to search for), and it looks for a space after the table name to differentiate between tables with similar names. Additionally, a table is considered a "source" if it appears after either of the words JOIN or FROM, and a "target" if it appears after any of the following options: INSERT INTO, INSERT, or MERGE. The reason for this distinction is that, for a given SP, a source has to have been loaded, updated, or deleted from prior to it running. Similarly, a target is its result, as that table will have been loaded, updated, or deleted from after the SP runs.
After the cursor, there is an INSERT statement. This code splits records that list a table as both a source and target of a single SP into two records – one for the source and one for the target. This ensures that we take all scenarios into account. Following this is a DELETE statement. This code deletes records for SP's that list either a table as both a source and target (where = 1 and [target] = 1) or as neither a source nor a target (where = 0 and [target] = 0). This gets rid of any redundant records, since we already split the ones where a table is both a source and target, as well as any unnecessary records, where tables are listed as neither.
Finally, we check to see if there are situations in which a table is a target of two SP's. In these cases, one of them has to run first, so we need to run an UPDATE statement to reset its source-target flag so that it shows up as a source. This will ensure the two SP's run in the right order, which will help with locking, as well as obtaining accurate data. In order to determine which one runs first, I have looked for cases in which one SP truncates or deletes from the table, and I've made the assumption that it would perform this action before loading or updating the table. For these cases, it is likely that there are two SP's performing similar actions (i.e. inserts or updates) but that they had to be separated due to the size of data.
In the next section, we perform a recursive JOIN on our list of tables and SP's to create the parent-child hierarchy. Basically, we use the tables and whether their associated SP's are sources or targets. For example, if two SP's reference the same table, and one is a source, and the other is a target, then the source SP has to run before the target (I have changed the naming convention to "parent" and "child" here to keep in line with normal hierarchy conventions). Keep in mind, a table can have multiple/no parents and/or multiple/no children. After the initial list, we load a distinct set of all the parent SP's that do not appear in the child column, and we give them NULL parents. These are the root SP's that don't depend on other SP's and can thus run at the beginning of the load. Finally, we DELETE all records from the hierarchy with circular dependencies. Because we split the records earlier, our JOIN will pick up both records (source and target) and put them together, thus creating records that list a SP as its own child/parent, so we run the DELETE to fix this.
The final section is specific to my own case. I am basically performing the same action as in the last section, but I am grabbing the ID's from my list of SP's in util.[Procedures], so I can load util.ProcedureDependency, which contains the cumulative list of dependencies (basically a parent-child hierarchy). Notice the EXCEPT statement. This ensures that I don't insert any duplicates into util.ProcedureDependency.
This should give you a preliminary list of SP's and their associated hierarchies. You may have to manually tweak the list a bit at the end, if you know that a certain SP depends on another, but the list doesn't reflect that. It would obviously be much better if we could get everything with SQL, but keep in mind this is still a bit of a work in progress. I will continue to provide updates to this post as I fine-tune this process. However, it is definitely a great start to figuring out how various stored procedures in your database interact.