I learned XPath a few years ago and always found myself frustrated with the documentation for it. There were a few basic concepts that seemed to trip me up on it as I learned it. My hope with this brief article is that I can make it a little easier for the next person who needs to learn this small but mighty tool.
What is XPath?
XPath stands for XML Path Language
XPath is designed to be used to point to parts of an XML document. We use it to do pattern matching between DOM nodes. It is used in XSLT, Selenium and other areas where DOM navigation is useful.
When looking at the syntax of an xpath query, view it as if the DOM is a file hierarchy that we are navigating, similar to URL paths. It intuitively makes a bit more sense that way. Each parent element is a “folder” that can contain other folders (child elements).
The general syntax is similar to regex and CSS selectors as well.
XPath query structure
XPath queries are made up of four parts.
- The prefix determines the starting point of the query.
- The axis refers to the relationship of the context node.
- The step is also the context node, the identifier of the element we’re referencing.
- The predicate makes the step more specific
Note: The less specific XPath queries are the more expensive they become, performance-wise. Similar to CSS selectors, there is a balance between specificity vs. flexibility and performance.
|Parts of an XPath query|
|Prefix||Step||Axis||Step with predicate|
Axis selector examples
Axis selectors allow us to “drill down” into the structure we’re processing to access the node we’re looking for.
||Anywhere in the document when prefix (This will set the context to any descendent element)|
||Child relative to the current node|
||Start at the root (This will also select the context to any child element)|
XPath also allows you to navigate up and down the hierarchy of the DOM, just like with folder navigation.
Selectors can be chained and can include some limited logic. They are based on various pattern matching criteria, similar to regex.
- relationship (child, sibling, preceding, self)
- attributes (id, class name, href)
- order (first, last)
- content (contains string “xyz”)
||Relationship selector, matches a direct child relationship|
||Order selector, selects second child
||Contains text, in this case matching a substring|
||Selects the parent of the
||Not selector. This example selects any
||An example of chaining. Here we’re selecting the first
A note about
contains(). This selector is rather loose and will select any string that contains the string parameter that is passed to it. This can cause unexpected results. In the example above, any button with the string
Go in it will be selected, in this case
Go Home and
Go to Next Page would both be selected. Combining the various selectors can produce the results you seek.
Scrapy documentation https://doc.scrapy.org/en/xpath-tutorial/topics/xpath-tutorial.html