Processing "Web 2.0" using XSLT document() variants? No thanks.

Mark Nottingham has written an intriguing piece "XSLT for the Rest of the Web". It's drummed up some interest, some of which has even leaked into the 4Suite mailing list thanks to the energetic Sylvain Hellegouarch. Mark says:

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

He goes on to present a set of extension functions he's created for libxslt. They are basically smarter document() functions that can do fancy Web things, including HTTP POST, and using HTML Tidy to grab tag soup HTML as XHTML.

As I read through it, I must say my strong impression was "been there, done that, probably never looking back". Certainly no diss of Mark intended there. He's one of the sharper hackers I know. I guess we're just at different points in our thinking of where XSLT fits into the Web-savvy apps toolkit.

First of all, I think the Web has more dragons than you could easily tame with even the mightiest XSLT extension hackery. I think you need general-purpose programming language to wrangle "Web 2.0" without drowning in tears.

More importantly, if I ever needed XSLT's document() function to process anything more than it's spec'ed to, I would consider that a pretty strong indicator that it's time to rethink part of my application architecture.

You see, I used to be a devotee of XSLT all over the place, and XSLT extensions for just about every limitation of the language. Heck, I wrote a whole framework of such things into 4Suite Repository. I've since reformed. These days I take the pipeline approach to such processing, and I keep XSLT firmly in the narrow niche for which it was designed. I have more on this evolution of thinking in "Lifting XSLT into application domain with extension functions?".

But back to Mark's idea. I actually implemented 4Suite XSLT extensions to use HTTP POST and to tidy tag soup HTML into XHTML, but I wouldn't dream of using these extensions any more. Nowadays, I use Python to gather and prepare data into a model representation that I then hand over to XSLT for pure presentation processing. Complex logical tasks such as accessing Web data beyond trivially fetched XML are matters for the model layer, and not the presentation logic. For example, if I need to tidy something, I tidy it at the Python level and put what I need of the resulting XHTML into the model XML before passing it to XSLT. I use Amara XML Toolkit with John Cowan's TagSoup for my tidying needs. I prefer TagSoup rather than tidy because I find it's faster and more robust.

Even if you use the libxml2 family of tools, I still think it's better to use libxml, and perhaps the libxml HTML parser to do the model processing and hand over resulting XML to libxslt in a separate step.

XSLT is pretty cool, but these days rather than reproduce all of Python's dozens of Web processing libraries therein, I plump for Python itself.

[Uche Ogbuji]

via Copia
7 responses
Hi Uche,



I know what you're saying, and I had reservations myself. I'm coming at it from perhaps a different angle; I started with xpath2rss, which uses a separate, XML configuration file to drive scraping.



Looking forward, I had other things to scrape that required a little more context (e.g., cookies, and a small smattering of conditional logic), and my first bias was to do it in Python.



However, I found that the benefits of having a separate file that contains the transform on its own are great (esp. when the sites you're scraping change), and the APIs for working with the XML blew out a few lines of simple, intuitive (to me) XSLT to a lot of obfuscated Python.



So, I pretty much see these extensions as useful for scraping other sites. Can they be abused? Absolutely. I draw the line for XSLT at getting the data; once it's in a format that I can read, the application goes over to pure Python.
Hey Uche,



I completely agree.  While obviously your pipeline processing language choice would be Python, mine C#, the point is exactly the same:  Use general purpose programming languages for general purpose programming such as "directing traffic" piping the return XML of one process to the input flow of another, if you're using C# or Java or other less-dyamic/more procedural languages conversion of complex types to their composite of simple types, a composite of simple types into the necessary complex data type, structure, or whatever you want to refer to it as... etc... etc... etc...



Domain Specific Languages are design to fulfill a very specific purpose.  XSLT was originaly designed to transform XML into XML, HTML, or Text...  now things are a little more complex as the ability to use XSLT to process text using a set of instruction elements and attributes designed for that exact purpose (or I should say various purposes that exist for processing text efficiently and correctly) has become possible in version 2.0. 



While I will agree with the fact that this does move away from the original purpose/intention of processing input XML it seems this was a direction that by natural order and selection, the XSLT development community obviously wanted this as part of the "package" given the immense amount of time spent building the various hacks that, while did get the job done, it was obvious these simply needed to become a part of the language.  Of course when you commit to a particular direction (in this case text processing) you can either half a$$ it and just put in the minimal amount of extended functionality to make the more common pieces of the text processing puzzle more natural and efficient, or you can commit to the entire shebang and complete the processing circle such that the need to build extended hacks to get the next level of evolutionary output "needs" doesn't exist.



In the end it seems that instead of viewing the eXtensible Stylesheet Transformations Language as XML specific, it has instead taken on more of an all encompassing transformation focused definition, whether that be XML or XML-based defined specs such as XHTML, "tag-soup"-based HTML, or just simple text (which all of these examples also qualify to be)... or in other words, by "order of the commitee", XSLT is really just a text processor that has specific capabilities, and even more so, optimized to take as input various textually structured schemas such as XML and variants of XML such as XHTML,  and then output whatever text format you might want, with optimizations again for the same structured text output such as XML and variants of XML such as XHTML.



If nothing else, what XSLT 2.0 has really done is allowed the ability to take as input the same thing it has always been able to output... which, from a recursive standpoint fits directly into and actually completes the full recursive circle, while still sticking withing the true spirit of DSL...  In this case, the transformation of text from one format to another, no matter what each format might be.



Oh, one thing...  remember your point regarding calling a transformation file a transformation file instead of as stylesheet?  But yet you seem to want to hang on to the notion that XSLT is for styling the output... so then stylesheet would actually be the proper term in this case.  Not trying to be a smart a$$ nor in any way offen. Just want to make sure the point you made in the past (which I completely agree with) doesn't now become irrelevant given the notion stated above.



But, again... I agree with the main point of your post... General purpose languages should not be replaced by hacked DSL's such as XSLT.  Totally, 100% behind you with this... couldn't agree more. :)
Hi guys,



Interestingly, the first time we had that discussion was right here (http://copia.ogbuji.net/blog/2005-06-20/Why_allow_).
/>


Back then, I was in fact afraid that Amara would try to be a replacement for XSLT where I thought XSLT was just the right way to go for one thing: templating (well XML to XHTML).



I even asked:



"""Does XSLT2 answer that issue?"""



Till then, I learnt what was the potential of Amara and where it stands in the processing chain. I also discovered that XSLT2 was doing too much IMO. Not to say XSLT/XPath2 are bad (I don't pretend I have the skills to say such things), but I feel like they try to encompass features that should definitely stay with your favourite language du jour: Python, C#, whatever.



Of course, I am often interested in things you can now do with XSLT2 but overall I don't know, it feels too overengineered in a way. I insist, it's a feeling. XSLT 1.0 has seemed to find the right balance, basic but comprehensive set of features whil not being intrusive in what a developer want to achieve in its design.



That being said, there are fields where it is sometimes easier or nicer to do some tasks directly from XSLT. EXSLT was one answer to such cases. The point I brought on Mark Nottingham's blog was to bring a couple of features via extensions that could spice things up a bit :)



I think now that I was more interested in testing my ideas (thrown without much thinking I must admit) rather than having a definite goal per se.



To sump up, I totally agree with what you stated in your entry and it's always nice to be pushed back on tracks from time to time.



- Sylvain
Here's a quick and simple theorom that just occured to me as a possible "Is a DSL doing too much?" test:



A DSL should transact one primary system verb and one primary system verb only.  In all cases the verb used represent a complex transaction type, but it must always be an end to end transaction, with no side effects that effect anything outside of the scope of transaction.



A DSL can use other pre-existing verbs internally (e.g. the document function obviously needs to implement the http "get" method when the document function contains an http based URI.) to create the desired result.  It can also implement as many underlying methods, functions, etc... that it needs to successfully complete its primary verb transaction.



So in the case of XSLT, the system verb would of course be "transform".  In no way should you ever attempt to stray from this primary system verb.  Instead, if a new transaction needs to take place that does not already existing in the base system or as a primary system verb of another available DSL, then a new DSL should be created for this primary system verb.



---



What do you think?  If nothing else it could at least be the start of something else that ultimately attempts to define constraints in which a true DSL must operate.



thoughts?
"Oh, one thing...  remember your point regarding calling a transformation file a transformation file instead of as stylesheet?  But yet you seem to want to hang on to the notion that XSLT is for styling the output... so then stylesheet would actually be the proper term in this case."



Not at all.  I consider the act of converting model XML to presentation XML or HTML transformation, not styling.  If the result of transformation is HTML, then I might style that HTML with CSS.  Ditto If the result of transformation is XML.  Don't forget that the "style" aspect of ur-XSL is what became XSL-FO, which does deal with font sizes, spacing, margins, and other stylistic matters.  XSLT has none of that, so it's not a stylesheet language, almost regardless of how it's used.  It's a transformation language.
Re: a test for DSL boundaries, I don't know.  I'm not sure I'd be so precise on such a broad question.  I think each DSL defines its own boundary.  Most of XSLT's original boundaries are well known.  I think that most of XSLT 2.0's best enhancements have

actually been aimed at filling originally unappreciated gaps between XSLT 1.0 and its intended boundaies.  So, for example, no one intended it to be so hard to do grouping in XSLT 1.0, so 2.0 adds much needed grouping primitives.  No one intended multi-pass processing to be so hard, so the RTF/node-set distinction was eliminated.  Methinks all of the places where XPath 2.0 and XSLT 2.0 annoy me are places where they escape the original XSLT 1.0 bounds (e.g. in adding data binding).



So I don;t know whether I could come up with a short and sweet rule for the propriety of a DSL.  I guess I just know it when I see it (and can objectively justify such observation based on specification or convention).
Hey Uche,



Late reply obviously, but you know why already, so I'll just leave it at that... :)



My bad on the misinterpretation of what you were saying regarding styling.



+1 back to you, -1 on me.. ;)



Regarding DSL's, the more I thought about it the more I realized theres already a way of "confining" a DSL...  they're called Domain's...  Yeah. Another -1 for me...



Enjoy your day!