When creating a database and an ETL process to either load or extract from it, you can help yourself quite a bit by creating a metadata repository. The benefit of this metadata is that you can use it to create data-driven ETL and other processes: instead of creating numerous lines of code for each particular object, you can create a dynamic process which uses metadata to perform the appropriate ETL for all your objects.
In particular, you can create a list of the stored procedures in your database (filtered or not) that you need to run, an on-off list which can help you automate processing by setting which ones will run and which won't for a given run, a log for more visibility into your processing, and an error log which performs the same function as the log but for errors only. You can even create a dependency list (parent-child hierarchy) for your stored procedures to determine the order in which you need to run your stored procedures, which helps most in the event that you are multithreading your ETL but still require some sort of order.
For my particular case, I created a schema for the metadata repository, which I named "util" (for utility). I created several tables and stored procedures which have already helped me perform the tasks I mentioned above, and it has already opened up doors for us to implement further process optimization down the line if we so desire. Throughout this post I will explain exactly which objects I created, as well as their purposes and benefits. I hope you enjoy and that this schema helps you with your own processes as much as it has helped me with mine.
The first table is just a list of the stored procedures in your database. You can load everything (all stored procedures in your database) using the sys or INFORMATION_SCHEMA tables/views, or filter the list to your specific needs. The table has an identity column which will serve as your ID and a PK on the column so we can create Foreign Key references to this table. It also has a flag for whether the stored procedure is active or not. This can help you automate your processing, as you can change the status of the stored procedure here, instead of the code in your package or job. One very important thing to keep in mind: make sure you don’t include any of the below stored procedures in here, as that will make them run as part of the job and not for maintenance/metadata.
[crayon-62b8759eb4e1c080481277/]
The next table will serve as a list for your batches. This will give you more visibility into your processing, as you can divide different loads into different batches. You will be able to look at individual batches to analyze performance. The table has an identity column which will serve as your ID and a PK on the column so we can create Foreign Key references to this table. It also has a default constraint for creation date, which will give you more visibility into when each batch was created/run.
[crayon-62b8759eb4e29825878575/]
I’ve created a queue table to manage which stored procedures will run in a given batch. The queue table contains a list of all the stored procedures that will run, grouped by batch. There is a FK reference to [util].[Procedures] and [util].[RunBatchHistory] to maintain integrity for an individual processing batch.
[crayon-62b8759eb4e2f446092700/]
I’ve created a stored procedure to create new batches and load the queue. It will grab a list of all the stored procedures that are listed as active. This helps automate your processing so that only the stored procedures you want to run will run.
[crayon-62b8759eb4e35974694952/]
The next table is your log. It has an identity column which will serve as your ID and a PK on the column so we can create Foreign Key references to this table. It also has a default constraint for insert time, which will give you more visibility into when each row was added. It also has two FK’s that reference [util].[Procedures] and [util].[RunBatchHistory]. (Note that they both cascade from the main table. This is to prevent any orphaned records.) This will ensure you always have visibility into which stored procedure ran and for what batch. For the status column, I like to use a three-numbered system for the stored procedures (1 = success, 0 = started, -1 = failure). You may prefer to have one record for the stored procedure, which you will insert upon start and update upon failure or completion, but I think having two lines per stored procedure gives you more visibility into things such as duration. You can also use that to get an idea of how many stored procedures were running at once, if you’re looking at parallel optimization and multi-threading.
[crayon-62b8759eb4e3a825735974/]
The next table is your error log. It has a cascading FK reference to [util].[Procedures], [util].[ProcedureRunLog]. This will give you visibility into any errors for a specific stored procedure in a specific batch, as it will tie any error messages to a particular RunID in the run log. There is a column for the error messages, which can either be system messages or user-defined, depending on how you create your individual stored procedures.
[crayon-62b8759eb4e41875008894/]
The only thing we need now is a log stored procedure, which will log the status for a stored procedure once it starts and then completes or fails. As I stated above, I like to use a three-number system for the status, but feel free to use whatever system you like. Make sure that @ProcedureID and @RunBatchID both exist in their respective tables ([util].[Procedures] and [util].[RunBatchHistory]), as they will violate the FK relationships if they don’t. @ErrorMessage can be either system or user-defined, depending on how you call this stored procedure.
[crayon-62b8759eb4e45528984777/]
That’s it for the basic util schema. The final two tables and remaining stored procedures are more specific to my case. I am trying to run my stored procedures in parallel threads, since my team and I have a complex ETL configuration that takes forever if we run in serial. However, I can’t just run any stored procedures in parallel, as that may cause issues, such as locking, or corrupt/inaccurate data, in the event that I load tables whose sources haven’t yet been loaded. As a result, I have created a dependency table.
The dependency table is a simple list of the stored procedures and their parents. There is a FK reference to [util].[Procedures] for both the child and parent columns. Root level stored procedures will have parents of -1 (this requires me to have a record in [util].[Procedures] with -1 ID, the code for which I have included below as well).
[crayon-62b8759eb4e4b591324717/]
In order to ensure these tables load correctly, and that the parallel loads work, I’ve created a final stored procedure. This stored procedure uses pretty much our entire util schema to get the job done. Normally (that is, in serial), the queue table could be used just as a list of the stored procedures to run, in order. However, that would require that your stored procedures in [util].[Procedures] are ordered correctly, since there is no logic to order them in [util].[CreateNewRunBatch]. Although that would be time-consuming, it is not impossible, especially in smaller loads, where the user knows the exact order of what is running. However, in our case, we are trying to run in parallel, which makes it a lot more complicated. Additionally, we have a massive ETL process, which requires certain tables to be loaded before others (meaning certain stored procedures run before others). As a result, for this process, [util].[ProcedureQueue] simply serves as a list of what will run. The stored procedure then uses [util].[ProcedureDependency] to determine the order. It ensures that, for a given stored procedure, all of its parents have succeeded before it runs. If the parents either failed or haven’t run, then it will not run. (For our purposes, having a failed parent is the same as having one that has not run, which you will notice in the code for RemainingParentCount.)
The stored procedure then looks for a stored procedure with a RemainingParentCount of 0. This means that the specific stored procedures parents have all run (or that it is a root level stored procedure: initially, all the root-level stored procedures will show up with 0, as their “parents”, -1 in [util].[Procedures], have already "run"). The stored procedure locks the table so that only the current thread grabs that particular stored procedure and the others cannot. It then logs that the stored procedure is starting to [util].[ProcedureRunLog] (status of 0). Finally, the stored procedure returns the ID and name of the, which I then raise up to my SSIS package, where I execute it. (I return both the ID and name so that I can execute the stored procedure from the SSIS package and then log its status using [util].[LogProcedureStatus] from there as well.)
[crayon-62b8759eb4e54139006278/]