________________________________________________________________________________
XPath post 1.0 got ridiculous, like many things do. What started with a simple, elegant language morphed into one with a http client, filesystem methods, json support, functions, loops, extensions and the ability to read environment variables.
I wrote a post about it a while back[1] (I regret some of the wording used there) and maintain a tool[2] that can exploit XPath injection issues. I'd recommend sticking with 1 or _maybe_ 2, and pretending 3.x doesn't exist.
1.
https://tomforb.es/xcat-1.0-released-or-xpath-injection-issu...
2.
I largely agree. XPath 2.0 started the downwards trajectory and XPath 3 made it worse.
The things XPath 2.0 and later do improve on XPath 1.0 is the "standard library", most of exslt got standardised in 2.0, and new useful functions got added in later revisions (e.g. contains-token from 3.1 is XPath finally adding the ~= operator from CSS).
Here's the deal though: it should be possible to add most functions without updating the rest of the engine (indeed the majority were originally developed for 1.0). I think some of the functions are designed to work with and around types, which would not be useful in 1.0.
There are other useful things besides functions
Sequences for example. In XPath 1 the query returns a set, so the output is always in document order. When the document reorders things, the query output changes, and you can never get the original output. In a sequence, the query can output anything in any order
XPath/XSLT 2+ also have only a single implementation (by the spec author) so don't meet W3C's requirement of two interoperating implementations. Basically, XSLT ceased to be a "standard" whereas XSLT 1.0 had excellent portability across libxslt, Saxon, Xalan, and MS' xslt.exe.
Edit: there is/was a token implementation for XSLT 2.0 called Gestalt
It also worth noting that the specification’s author also built his company on this single implementation.
Saxon supports all xpath versions though? It also bundles some very dangerous functions, some of which xcat can take advantage of.
XPath always was extensible, at least at the implementation level. E.g. in 'lxml' it's trivial to add XPath functions with Python. Homegrown, of course, but still possible. In addition to extension elements this is about the only way to hook XSLT into the rest of the system. How else one is supposed to read environment variables from XSLT? The only other way is to pass everything via command line as parameters.
It's insecure to run untrusted XPath, but isn't it same with untrusted anything? A good solution here could be a way to sandbox such XPath, i.e. to limit which functions can be called, the same way it's done with XML where you can forbid the processor to use network or access arbitrary files on case-by-case basis.
> How else one is supposed to read environment variables from XSLT?
Setting aside whether it’s even a good idea to allow XSLT to do that, XPath is only a subset of XSLT, so you’re just changing the subject. The “path” in XPath should be a hint at what it’s supposed to be: a query language to select nodes by path in XML documents. As opposed to an alternative of Awk, or Perl.
XPath 3 was conceived with support for XSLT and XQuery in mind - where reading environment variables and text files are most definitely very useful features. This is indeed not something you want in a browser, but that ship had already sailed by then.
I've read your article... Holy shit. They took a simple, sed-like tool and turned it into an abomination.
It ain't done before it can receive e-mail.
_Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can._
I was going to suggest Vim as a counterexample, but sure enough
https://github.com/soywod/iris.vim
That's a community based plugin, Vim is still focused on text editing and not much else.
Most of the functionality on the editors like vim and emacs comes from community based plugins. People would mostly not use them if there were no such expansions.
This baseless assertion is simply wrong. Plugins are nice to have, but the bulk of their use is to customize default installs.
Curious what is the problem with this? You can still use your small sed-like subset of language in your project?
>>> They took a simple, sed-like tool and turned it into an abomination.
> Curious what is the problem with this?
Product Managers.
With the rise of numerous hierarchical document formats (JSON, YAML, TOML, properties files), what XPath REALLY should have evolved into was a more format-flexible path language with format-specific extensions as needed.
> what XPath REALLY should have evolved into was a more format-flexible path language with format-specific extensions as needed.
You can probably already do that just fine: ignore attribute nodes, and e.g.
{"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }} /menu/popup/menuitem/*[last()]/preceding-sibling::*[1]/value
selects "Open". Something along those lines.
Maybe relax nodetypes so they can be pluggable per-language, but I'm not sure that's even useful or necessary.
Damn. XPath went off the rails after v2. Though, to be fair, so did JavaScript, and look where that is today!
A bloated abomination?
Abomination with 500000 open job offerts and ppl wanting their page to load 2 minutes, because we can do that “async” and download half of internet to display a table.
But does it send email?
Holy crap. What is this atrocity that is XPath 3.0!? What was wrong with sprinkling some XPath 1.0 queries into a Python script?
Anyone who does scraping or automated browser work eventually comes across XPath.
In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.
I avoided XPath until I couldn't anymore. I could do a lot with CSS selectors, but eventually the DOM traversal became difficult to reason about w/ just CSS.
After taking the dive, it's so powerful. Read a single XPath and like regex, you can fully understand what the thing is going after and how it will get there.
There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.
> In some ways, XPath is like regex. It's got insane power, but comes with a relatively steep learning curve. Remember reading regex for the first time? What? But unlike regex, the number of people using it are few in comparison.
IMO the learning curve of XPath is not that high though, it has a somewhat alien syntax but the only thing I remember giving me trouble is _axis_, because most tutorials just go on with the "shortcut" syntax so the first time you encounter axis everything goes pear-shaped.
> There are functions in XPath 2.0 that I would love to have, but Nokogiri for Rails is stuck in 1.0 world with no plan to go to 2.0. Sad, but I'll live.
Nokogiri should support function extensions[0] and most of the XPath 2.0 functions were originally extensions to 1.0[1], so even if these functions are not distributed with nokogiri you should be able to add them yourself.
Incidentally, Nokogiri seems to optionally depend on libexslt, which is the exslt implementation in C for libxml2/libxslt, so exslt should be available either as an option or by building it yourself.
[0]
https://github.com/sparklemotion/nokogiri/commit/eb56525fbcc...
[1]
Many moons ago I worked somewhere that used XPath extensively.
Definitely a serious learning curve, some of the developers really struggled with it, others went crazy on it.
I made a pivot table maker with it. It was crazy fast vs the js version I originally tried back in the pre-v8 engine days. The js version would basically die after you got past a trivial amount of data, the xlst one was instant regardless of the amount of data.
I agree. I think the original author completely missed the point and conflates lack of mainstream usage with dead tech. If you never run into problems that xpath addresses, of course you’ll never use xpath. It’s not for everyday use. And certainly shouldn’t be billed as a CSS selector replacement.
I think their complaints about browser support are fair (orthogonal to whether the newer versions are any good, which most of the comments here are talking about!)
In a self-managed environment, like a PC or server, then you're right that popularity makes little difference.
Similar, when I have control of the source code then CSS selectors are fine (I can always throw in another ID or Class Name). When I don't have control of the source code then I might have to use XPath if CSS selectors are insufficient.
If you need to do web scraping learning xpath is very helpful
Xpath is so powerful for web scraping I just realized recently. I'd been using css selectors for my occasional scraping needs and never bothered to learn xpath until on day on a whim decided to learn at least the basics.
Man I can now write scrapers in 2 minutes that used to take me quite some time thanks to the power of xpath. Thing like ancestors, contains, the ability to chain, etc is so so powerful. I used to write so many hacks just to do the same with css before.
I realized a couple months back that Google sheets supports using xpath to scrape web pages. So now I have a "spreadsheet" scraping a page to see when a model of laptop goes on sale. Seems to work; at least, whenever I go double check that page manually it matches the scraped result.
The only problem with the built-in IMPORTXML() function is that it doesn't execute pages with JavaScript. If you ever run into issues give API Importer a try (where I run a headless browser to execute the JavaScript):
https://gsuite.google.com/marketplace/app/api_importer/52965...
Can you point out the resources you used for learning ?
I wrote a lot of scrappers and am knee-deep in css hacks.
If you know CSS well, I find this useful:
The problem with xpath is that you rarely use it, so you forget how to do certain things. Then you have to go and re-learn when you need it. Rinse and repeat.
Take a look at xidel.
It is the swiss army knife of scraping indeed. I feel like I can do anything with a scraper thanks to XPath.
XPath and XSLT was the first time (despite doing Haskell at university) that I started to really understand functional programming. The first time was working on a tech stack that was basically Microsoft SHAPE queries transformed into HTML. The second was multiple projects customising Google custom search engine results. It was weird realising that these very limited primitive were actually infinitely powerful if you were willing to warp your brain the right way.
That said, I scrape a fair few webpages now and have never once revisited XPath. I suppose people have mostly written off anything that feels too much like XML as enterprisey and deprecated.
Indeed, xslt was the first pure functional language to achieve some popularity (if not love) among wider circles of software developers. At least, so it was in 2000s.
There may be need for a replacement with simpler syntax; i wonder if GraphQL might be used in that role.
Out of curiosity, what tools do you use for scraping? Is there a similar simple tool for defining queries over trees?
fyi: XSLT was designed by James Clark, based on concepts and experience of XSLT's Scheme-based predecessor DSSSL. So there's your alternate syntax :) In a way, DSSSL has yielded to a XML-ish surface syntax much like JavaScript, also conceived as a LISPy language, yielded to a Java-ish (awk-ish, actually) syntax.
I've often compared GraphQL to SOAP-XML with WSDL. It's nearly the same thing, and just about as boilerplate-y.
XSLT is about templating/transforming one XML doc into some other format. And there are simple replacements that largely fill the same role. Mustache, Handlebars, Template Toolkit (which was also the simpler solution back when XSLT was popular).
XPath and XML in general is a great example of "Death by Committee". They tried too hard to be too smart and try to solve everything, and overcomplicated it to death. This is why people largely abandoned it. This is what is happening to C++ and they are steering themselves by committee into a dead end.
I'm starting to pay more attention to technologies that are resistant to this. Maybe I'm just getting old that I'm beginning to value mature, proven technologies over fads. More importantly, is the difficult skill of being able to spot them.
The troule is they are as rare as hens teeth. The temptation to add a little more is overwhelming. I know I suffer from it myself.
Yeah the committee's decision to avoid ABI breakage is a serious _deathblow_ against the language. Especially when a formal ABI was _never_ defined in the first place. So, C++ is stuck with poor implementations for std::regex and std::unordered_map for ever. Where even interpreted languages can beat it.
What, you think tying namespaces to a web domain that is in no way actually used as one and results in XML that is unreadable in its fully qualified form (or in fact not even valid XML) and changes not just meaning but _value_ as you try to copy paste any part of it, was a bad idea?
The biggest problem with the new XPath versions is that the W3C made the standards, but almost no one implemented them, so you cannot actually use them
I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2. And currently I am updating it to XPath 3.1:
http://www.videlibri.de/xidel.html
Yooo, thanks for Xidel! I use it dozens of times per week. It's amazing. Next to the actual shell probably the single most useful ETL and scraping tool I've ever encountered. Keep at it!
> I was doing web scraping, and needed regular expressions to get the text, so I have implemented XPath 2.
Most XPath implementations have no issue with adding extension functions (in fact many support exslt[0] out of the box), you really do not need to use (let alone implement) XPath 2.0 to use regex functions.
[0]
http://exslt.org/regexp/index.html
I did not plan to implement it all, only the parts I needed for the webpages in my city. At first I did not even have backward axes. But people care much more about XPath than they care about my city
I also was doing too much competitive programming back then, where you have to discover and implement a highly complex algorithm in a few hours
If such a complex implementation takes a few hours, I could not imagine implementing anything else taking much longer (especially when the spec already says what needs to be implemented and it does not need to be discovered). A few days at most...
But now I am still working on it 14 years later
I don't think this especially changes the underlying point: anyone using tools which were based on libxml2 or xerces is basically stuck in 1999. Having to find and install custom extensions adds a regular frictional cost which encourages you to just do more work in a full programming language since you know you'll be able to satisfy any requirement that way.
I saw so many developers sour on XML after hitting the “This would be easy if we used XPath 2 but instead it's hard” wall that I wonder if anyone on the relevant standards committees ever thought about how much libxml2 would make their work relevant.
Though I've never used Xidel, I came across it when researching XPath 2/3, and was very impressed that anyone managed to implement these massive, complicated specs all by themselves.
The major OSS XML libs, including LibXML2 and Xerces, do not implement what Xidel does, and neither to some proprietary libs like MSXML.
I love the XPath model of declaratively querying and transforming data, which has been highly influential (see JQ, JSONPath, GROQ, etc.). Ultimately, it was too closely tied with XML, which was overdesigned complex, and sucked into the committee hell that brought us more overdesigned technologies like SOAP and XML Schema.
With increasing power comes the likelihood that people accidentally implement behavior that is nonpolynomial. It looks good in testing but then with real live data starts taking seconds to render/re-render. There are probably examples of this already in CSS but seems more likely with arbitrarily backtracking XPath expressions.
Xpath 1.0 is maybe the single most useful output from the XML universe. Did something like it exist before?
XPath 1.0 was released in the late 90’s. I remember using it in some server-side XML processing code (Java 1.2?) It did the job where the alternative was writing a ton of procedural code to get at a specific node, etc.
XPath 3 and XQuery 3 are powerful and great technologies to query XML if you need that stuff. The problem is that most implementations cover XPath 1.0 because I guess it is too difficult (i.e. time consuming and involved) to produce a 2.x or 3.x implementation, let alone with full W3C XML Schema support. There is also BaseX which implements XQuery 3.x which is a nice native XML database. I really dig XML and its technologies. I wish XQuery 3.x was available everywhere.
One of the huge gaps in JSON tooling is there isn't a standard XPath equivalent (there's JSON Pointer, but it's nowhere close to XPath, and JSON Path which isn't standardized) and no XSLT equivalent.
For as painful as XSLT was, at least it was a standard thing that existed.
I remember spending good two weeks writing XPath parser in C and then the client changed their system responses to JSON. My last experience with XPath.
Shameless plug of DefiantJS[1] that gives a lovely fast XPath query capability to JSON data.
1.
I do a bunch of of XML/XSLT work still. I use XPATH 1.0 basically everyday. It's also awesome for web scraping. Overall, it's a great tool that doesn't get a ton of exposure.
Is there something I can read to get up to speed on xpath? Any recommendations for online or printed resources? (Particularly from folks who use it regularly!)
XPath is hard to replace when writing Selenium WebDriver scripts. Thank you for existing, XPath.
XPath is great, and works equally well in lumbering, ceremony-heavy Enterprise Java environments; and in quick bash one-liners.
I use it in a bunch scraping scripts for Web sites which don't provide RSS feeds. It's really nice for quickly 'exploring' a document to find the needed data; it's simple to update when sites change their layout; and it can be read in from a config file, argument, env var, etc. to keep things generic and flexible.
I thought XPath was pretty terrific for the day. It let you transform XML into a user interface in an entirely declarative way -- not just the appearance of items like CSS but the actual content could be inspected and altered. I built some cool things in XPath before frameworks like Angular took over.
It sounds like you are talking about XSLT, not XPath.
Are you perhaps confusing XPath with XSLT (which uses XPath for selecting elements) here?
Yes, I was! Thanks.
XPath is still a great way to reach into an xml file and grab a value
if you do any type of webscraping. xpath is the way to go. thanks to my former co-worker Justin, for showing me that.
I used xpath last week for something
This is that weird language you use to make WebDAV servers look okay in a browser, right?
I think you're referring to XSL. The heavy lifting is done by the transformation language (XSLT), but XPath is definitely an underlying tool.
I may be wrong, as it's been some time since I worked with them, but I think XPath is both its own standard, and a part of XSL at the same time. A lot of XSLT deals with selecting nodes from the source and it happens with XPath expressions.
W3C standards usually depend on and leverage other standards, so XPath is its own standard, which is used by XSLT (and XQuery, and possibly a few other things).
You can't use XSLT without XPath, but you can use XPath on its own.
It's hard not to read this as satire, because XPath is so inelegant. Not that CSS selectors are a model of elegance, but it gets the job done (most of the time) and is easy enough for rookie devs and designers to pick up.
> XPath is so inelegant. Not that CSS is a model of elegance
XPath is infinitely more elegant than CSS selectors.
It might be less _pretty_, especially with the sort of selectors you'd use in CSS, but elegance could hardly be _less_ of an issue.
I don't completely understand this sentiment. I mean, when confronted with "rookie devs", surely our focus should be on transforming them into not-rookie devs, not on transforming our tooling into a dumbed-down version...
Plus, while 'elegance' is mostly subjective, I see a lot of it in XPath. It's a DSL for describing generic tree traversal, its concise and declarative, and frees its users from the need of writing imperative or recursive, repetitive and easy to mess up, tree traversal code. Just not having to maintain state by hand during traversal is a huge timesaver. Additionally, at least in version 1 + some early extensions, XPath is much less complex than PCREs and not much more complex than CSS syntactically.
edit: typos
CSS and XPath don't share any functionality.
Maybe you had XSLT in mind (which still is different) - or the fact that both CSS and XPath have "selectors". But one is used to get nodes (as a library), the other is a styling language.
And XPath, at least originally, had an extremely elegant path language.