How to Hack an XPath through the Occasional ETL Jungle
The SSIS model of drag-and-drop ETL development is great for building out processes, but when it comes to analyzing those processes, the GUI can quickly become a double-edged sword.
A fork in the complexity road:
On the one-hand,
- Execution flow is obvious when it’s visual.
- Container shapes hint at sequencing and parallelism (or a need for either).
- At a high-level, it’s usually easier to reason about processes at a visual level than the code level. (Flowcharts, anyone?)
- And of course, look how simple SSIS makes your life:
On the other hand, Look how bewildering SSIS makes your life:
I’m not here to judge. To be fair, it could be much more complex than this. However, what happens when you need to find something?
Don’t worry; you’ve got a compass:
For simple searches, the SSIS find dialog can be enough. It even has options, like regex, so you can not only find foobar, you can find all the things that start with foobar:
But sometimes you need a GPS-enabled machete:
For anything other than the simplest cases, the SSIS Find feature tends to fall short. It may be perfect for finding all the foobars, but your search needs in SSIS can easily become far more complex.
For example, consider the humble Execute Process Task:
For the uninitiated, the SSIS Execute Process Task allows you to invoke an executable, with arguments passed as needed. I think of it as a last resort before creating C# Script tasks. As with many of SSIS’ more interesting tasks, Execute Process has several properties to work with:
The most crucial properties here are of course “Executable” and “Arguments,” where some of the most common executables would be cmd or powershell.
Now suppose the bewildering SSIS package shown at the beginning of this post is chock full of Execute Process tasks. (It is.):
And suppose I want to find all the Execute Process tasks that have cmd.exe as their executable and fizzbuzz.exe somewhere in their arguments:
And just to make things extra interesting, suppose I only want to get these tasks if they aren’t commented out.
To review, I want to get Execute Process tasks:
- With cmd.exe as the executable
- With fizzbuzz.exe among the arguments
- Not commented/not a child-of-commented
Suddenly the SSIS Find dialog is looking a bit wimpy.
For argument’s sake, let’s take it as a given that if you have this kind of search problem, you at least have some familiarity with the XML data format. If that’s not the case, a quick tour of the subject on W3Schools should suffice for an introduction.
It may be tempting to think that you could find arguments from your cmd-executing, fizzbuzz-arg-passing, uncommented execute process tasks just by switching to code view (courtesy of F7):
Technically, you certainly can search this XML manually.
Technically, you could actually find all the SSIS info you’re looking for without even leaving the default Designer view of SSIS, but that approach is:
- Prone to human errors in reading, copying, pasting, etc.
XPath, short for XML Path Language, is a language made for searching and extracting data that is in the XML format.
There are numerous tools for using XPath. In fact, TSQL and PowerShell both support it. You can even use XPath to track down elements of a webpage in most browsers, via the developer console. For demo purposes, though, let’s consider a utility that’s simple, cross-platform and tailor-made for XPath: Xidel.
Xidel is an open-source (GPLv3) utility for downloading and extracting data from XML, HTML and JSON, but we’ll just scratch the surface of the XML use-case, here.
Hacking your first XPath:
To get started with Xidel, perform the following steps:
- Download Xidel: http://www.videlibri.de/xidel.html#downloads
- Extract the downloaded archive to a location of your choice.
- Once extraction has completed, this location should contain the executable “xidel.exe”.
- In a command-prompt, navigate to the location from step 2.
You should now be at a prompt something like this:
Now for XPath itself.
XPath is a pretty small language, relatively speaking, but giving a good overview is beyond the scope of a blog post, so let’s focus on some of the main features:
- Path expressions
We’ll also be using the “contains” function and the “ancestor” axis.
This may sound like a lot, but it’s really not that much. Let’s start with Path Expressions.
Path expressions are the basic building block of XPath. A path expression tells XPath which XML nodes you want to select, and is built of some combination of the following:
For starters, take a look at one of the Execute Process tasks we’re trying to select:
Here we have an XML element with the nodename “ExecuteProcessData.” Since we don’t particularly care where in the SSIS package these tasks occur, we can use // to select all the execute process nodes regardless of where they occur. The combination of // and “ExecuteProcessData” gives us the following Path Expression:
This on its own is a valid XPath expression. In fact, if we include the path to the SSIS package we’d like to search, we can pass this expression to xidel with the –extract option, like so:
>xidel.exe --xml [pathToSsisPkg] --extract “//ExecuteProcessData”
This would return all the Execute Process tasks in the specified package. Of course, we have a few more conditions we want to specify.
Predicates and Functions:
To get only the Execute Process tasks with “cmd.exe” we’ll need to add predicates and functions to our XPath toolbelt.
An XPath predicate follows a path expression and is wrapped in square brackets. The value in the brackets tells XPath how to find a particular XML node. For example, look at the following XPath:
If passed to xidel or another XPath tool, this expression will return only the arguments from first Execute Process task in the SSIS package we’re searching.
A predicate can also contain path expressions. We can take advantage of this to pass @, the attribute symbol, to tell XPath we want to look at the Executable attribute:
However, this doesn’t say anything about what we want to search in the Executable attribute. To add that information, we need one more part – the contains() function:
As with most similar string functions, the XPath contains() function takes a target as its first argument, and a search expression as its second argument. This means the above will return all the Execute Process tasks that have ‘cmd.exe’ somewhere in their Executable attribute. Our XPath is almost hacked, but not quite. Next we need:
Operators and Axes:
The typical comparison and Boolean operators are also available in XPath, including:
If we use the and operator, the same XPath we used above can be modified to satisfy our second search condition (Arguments contain Fizzbuzz.exe):
//ExecuteProcessData[contains(@Executable, ‘cmd.exe’) and contains(@Arguments, ‘Fizzbuzz.exe’)]
Only one more XPath feature stands between us and our final search: Axes.
An XPath axis is a keyword that specifies a node/nodes relative to the current node:
To illustrate the purpose of an axis, consider the following XPath:
Here the ancestor keyword is used inside a predicate, to return any ExecuteProcessData tasks which have FooAncestorNode among their XML ancestors. The double-colon is used to delimit an axis from the node(s) we are looking for along that axis.
Putting it all Together
Returning to the XPath that we’re actually trying to build, it’s easy to satisfy our last search condition (that the Execute Process tasks should not be commented out, or children of nodes which are commented out). To meet these last criteria, we can combine the ancestor axis with the not operator:
//ExecuteProcessData[contains(@Executable, ‘cmd.exe’) and contains(@Arguments, ‘Fizzbuzz.exe’) and not(ancestor::DTS:Executable[@DTS:Disabled])]
Finally, this gives us the full XPath for our specialized search case.
Note: The two instances of “DTS:”, which may look unfamiliar, are merely including the SSIS xml namespace in the nodenames, because that’s how those nodes appear in the package itself. Xidel can handle these without importing the SSIS xml schema, so don’t be afraid to use the DTS namespace if it appears among the SSIS nodes you’re searching.
Once you’re satisfied with the XPath you’ve written, you only need pass it to an XPath utility, such as Xidel, to hack your way out of your particular SSIS jungle:
>xidel.exe --xml [pathToSsisPkg] --extract “//ExecuteProcessData[contains(@Executable, ‘cmd.exe’) and contains(@Arguments, ‘Fizzbuzz.exe’) and not(ancestor::DTS:Executable[@DTS:Disabled])]”
This one-liner will return all of the Execute Process tasks we’ve been working to find (if there are any that meet the criteria, of course).
Other Jungles, Same Machete
Xidel is just one tool for using XPath. It lends itself well to demonstration since it is easily installed, has no dependencies, and assumes no prior knowledge – other than a working understanding of XML, of course.
Once you’ve grasped XPath, you’ll see that although it may not be a broadly applicable tool, it is indispensable for certain use cases.
In Data Warehousing specifically, it can also be useful to keep in mind that XPath is supported by TSQL and PowerShell:
TSQL XPath tips:
- General: https://docs.microsoft.com/en-us/sql/relational-databases/sqlxml-annotated-xsd-schemas-xpath-queries/introduction-to-using-xpath-queries-sqlxml-4-0?view=sql-server-2017
- XQuery (improved XML parsing): https://docs.microsoft.com/en-us/sql/relational-databases/sqlxml-annotated-xsd-schemas-xpath-queries/introduction-to-using-xpath-queries-sqlxml-4-0?view=sql-server-2017
PowerShell XPath tips:
- General: http://www.powershellmagazine.com/2014/06/24/pstip-using-xpath-in-powershell-part-1/
- Note: While it is arguably more powerful than Xidel for executing XPath, PowerShell is also more picky about its input. When using PowerShell to extract data from an SSIS package, you must include the DTS namespace if you want to get very far!
- SO thread: https://stackoverflow.com/a/2103190
- Describing all of the above PowerShell snippet is beyond the scope of this article, but it should be clear that it opens SSIS packages, gets xml from them, imports Microsoft’s official DTS namespace, and executes some XPath.
Here’s hoping this article clears more jungle than it creates -- Happy XPath Hacking!