"Tip: Remove sensitive content from your XML samples with XSLT"

"Tip: Remove sensitive content from your XML samples with XSLT"

Do you need to share samples of your XML code, but can't disclose the data? For example, you might need to post a sample of your XML code with a question to get some advice with a problem. In this tip, Uche Ogbuji shows how to use XSLT to remove sensitive content and retain the basic XML structure.

I limited this article to erasing rather than obfuscating sensitive content, which can be done with XSLT 1.0 alone. With EXSLT (or even XSLT 2.0) you can do some degree of obfuscation, allowing you to possibly preserve elements of character data that are important to the problem under discussion. Honestly, though, I prefer to solve this problem with even more flexible tools. As a bonus the following is a bit of 4Suite/SAX code that uses a SAX filter to obfuscate character data by adding a random shift to the ordinal of each character in the Unicode alphanumeric class. This way if exotic characters were part of the problem you're demonstrating, they'd be left alone. It's easy to use the code as a template, and usually all you have to change is the obfuscate function or the obfuscate_filter class in order to fine-tune the workings.

import re
import random
from xml.sax import make_parser, saxutils
from Ft.Xml import CreateInputSource, Sax

RANDOM_AMP = 15
ALPHANUM_PAT = re.compile('\w', re.UNICODE)

def obfuscate(old):
    def mutate(c):
        return unichr(ord(c.group())+random.randint(-RANDOM_AMP,RANDOM_AMP))
    return ALPHANUM_PAT.subn(mutate, old)[0]

class obfuscate_filter(saxutils.XMLFilterBase):
    def characters(self, content):
        saxutils.XMLFilterBase.characters(self, obfuscate(content))
        return

if __name__ == "__main__":
    XML = "http://cvs.4suite.org/viewcvs/*checkout*/Amara/demo/labels1.xml"
    parser = make_parser(['Ft.Xml.Sax'])
    filtered_parser = obfuscate_filter(parser)
    handler = Sax.SaxPrinter()
    filtered_parser.setContentHandler(handler)
    filtered_parser.parse(CreateInputSource(XML))

This code uses recent fixes and capabilities I checked into 4Suite CVS last week. I think all the needed details to understand the code are in the SAX section of the updated 4Suite docs, which John Clark has already posted.

[Uche Ogbuji]

via Copia

"XML in Firefox 1.5, Part 2: Basic XML processing"

"XML in Firefox 1.5, Part 2: Basic XML processing"

Subtitle Do a lot with XML in Firefox, but watch out for some basic limitations

Synopsis This second article in the series, "XML in Firefox 1.5," focuses on basic XML processing. Firefox supports XML parsing, Cascading Stylesheets (CSS), and XSLT stylesheets. You also want to be aware of some limitations. In the first article of this series, "XML in Firefox 1.5, Part 1: Overview of XML features," Uche Ogbuji looked briefly at the different XML-related facilities in Firefox.

I also updated part 1 to reflect the FireFox 1.5 final release.

This article is written at an introductory level. The next articles in the series will be more technically in-depth, as I move from plain old generic XML to fancy stuff such as SVG and E4X.

[Uche Ogbuji]

via Copia

Thinking XML #34: Search engine enhancement using the XML WordNet server system

Updated—Fixed link to "Serving up WordNet as XML"

"Thinking XML: Search engine enhancement using the XML WordNet server system"

Subtitle: Also, use XSLT to create an RDF/XML representation of the WordNet data
Synopsis: In previous installments of this column, Uche Ogbuji introduced the WordNet natural language database, and showed how to represent database nodes as XML and serve this XML though the Web. In this article, he shows how to convert this XML to an RDF representation, and how to use the WordNet XML server to enrich search engine technology.

This is the final part of a mini-series within the column. The previous articles are:

In this article I write my own flavor of RDF schema for WordNet, a transform for conversion from the XML format presented previously, and a little demo app that shows how you can use WordNet to enhance search with synonym capabilities (and this time it's a much faster approach).

I hope to publicly host the WordNet server I've developed in this series once I get my home page's CherryPy setup updated for 2.2.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, RDF equivalents for the WordNet/XML.

[Uche Ogbuji]

via Copia

" Process Atom 1.0 with XSLT"

"Process Atom 1.0 with XSLT"

Learn XSLT techniques for processing Atom documents. In this tutorial, author Uche Ogbuji shows how with real-world use cases. (free registration required)

Atom 1.0 is [the] Internet Engineering Task Force (IETF) standard for Web feeds -- information updates on Web site contents. Since Atom is an XML format, XSLT is a powerful tool for processing it. In this tutorial, Uche Ogbuji looks at XSLT techniques for processing Atom documents, addressing real-life use cases.

This tutorial shows you how to:

  • Navigate the basic structure of Atom 1.0 documents using XPath expressions
  • Use these expressions to drive XSLT transformations of Atom source files
  • Deal with the complications of text and markup embedded in Atom files You will also learn how to use XSLT templates to generate valid Atom files, and how to check the validity of the results.

A companion piece to my recent XML.com article "Handling Atom Text and Content Constructs", this is a task-driven tutorial, taking a more deliberate pace and focusing on XSLT.

developerWorks has had a lot to say about Atom lately, courtesy James Snell (who is also writing a lot of useful Atom extension drafts).

I guess how do you celebrate Atom's promotion to RFC 4287? Why by cooking up even more reading material.

[Uche Ogbuji]

via Copia

Agile Web #2: "Handling Atom Text and Content Constructs"

"Handling Atom Text and Content Constructs"

Uche Ogbuji's Agile Web column returns with a look at handling some of the trickier issues in the Atom Syndication Format, which has recently become RFC 4287, an internet standard.

Second article in my new column is out. In this one I focus on Atom text and content constructs. I spent more time on the Atom examples and less on the sample processing code, but I thought more of the former would be especially useful. I've been working with and writing about Atom a lot lately, and in fact I have an IBM developerWorks tutorial for Atom processing in XSLT in production. It should be live some time today.

Joe Gregorio has been working the other half of the Atom pie (old joke for folks who've been following Atom), and he has a very timely new article out: "Catching Up with the Atom Publishing Protocol".

And once again, if you'd like to discuss Atom (syntax or publishing protocol), please do join us on the #atom channel on irc.freenode.net.

[Uche Ogbuji]

via Copia

XSLT for converting from OPML to XBEL and XOXO

In all this Web feed hacking I've been working with my list originally exported from Lektora in OPML format. I wrote XSLT to convert from OPML to XBEL and XOXO. In the case of XOXO I really couldn't figure out any common conventions for Web feeds so I made up my own for now. The resulting XBEL looks a lot easier to work with, so I'm propose extensions for feed URL / site URL coupling in the renewal of XBEL. I figured my XSLT might be useful to others, so here are the links:

Going from XBEL to OPML, I've been using Dan MacTough's XSLT. (He also has an XBEL to XHTML transform). I sometimes have to tweak the resulting attributes to deal with xmlUrl/url and title/text type OPML madness.

I've also posted my Web feed list in XBEL form. It uses old school XBEL 1.0, and not any of the metadata additions I'm hoping to see in 1.2. As such, it's only a list of Web feeds and doesn't include the corresponding Weblog home pages.

[Uche Ogbuji]

via Copia

XSLT l10n in 2 tags? 20 says the gentleman over there? Ah, the lady in green has 200.

I was pretty well ROTFL after reading this Daily WTF (via XSLT Blog). OK, so that code isn't even really doing l10n. I'm not sure what the coder thinks it's doing. It's a complete exercise in useless cut and paste. But it's worth noting that you can do the task of competent l10n in a tenth of the tag load used in the WTF example (see Docbook XSL), and you can even do it using a tenth of the tag load used in Docbook, if you don't mind using an XSLT extension module.

[Uche Ogbuji]

via Copia

XSLT l10n in 2 tags? 20 says the gentleman over there? Ah, the lady in green has 200.

I was pretty well ROTFL after reading this Daily WTF (via XSLT Blog). OK, so that code isn't even really doing l10n. I'm not sure what the coder thinks it's doing. It's a complete exercise in useless cut and paste. But it's worth noting that you can do the task of competent l10n in a tenth of the tag load used in the WTF example (see Docbook XSL), and you can even do it using a tenth of the tag load used in Docbook, if you don't mind using an XSLT extension module.

[Uche Ogbuji]

via Copia

Don't give me that monkey-ass Web 1.0, either

Musing about whether XML and RDF are too hard (viz. Mike Champion's summary of Bosworth), and whether XQuery and OWL are really the right food for better XML tools (viz: Mike Champion's summary of Florescu), my first reaction was to the latter idea, especially with respect to XQuery. I argued that declarative programming is the key, but that it is quite possible to take advantage of declarative programming outside of XQuery. Nothing new there: I've been arguing the marriage of XML and declarative techniques within "agile" languages for years. I don't think that declarative techniques inevitably require bondage-and-discipline type systems (thanks to Amyzing (1), (2) for that killer epithet).

Since then, I've also been pondering the XML-too-hard angle. I think folks such as Adam Bosworth are discounting the fact that as organizations increasingly build business plans around aggregation and integration of Web material, there comes an inevitable backlash against the slovenliness of HTML legacy and RSS Babel. Sloppy might be good enough for Google, but who makes money off that? Yeah. Just Google. Others such as Yahoo and Microsoft have started to see the importance of manageable text formats and at least modest depth of metadata. The IE7 team's "well-formed-Web-feeds-only" pledge is just one recent indicator that there will be a shake-up. No one will outlaw tag soup overnight, but as publishers find that they have to produce clean data, and some minimally clean metadata to participate in large parts of the not-Google-after-Web marketplace, they will fall in line. Of course this does not mean that there won't be people gaming the system, and all this Fancy Web agitation is probably just a big, speculative bubble that will burst soon and take with it all these centralizing forces, but at least in the medium term, I think that pressure on publishers will lead to a healthy market for good non-sloppy tools, which is the key to non-sloppy data.

Past success is no predictor of future performance, and that goes for the Web as well. I believe that folks whose scorn of "Web 2.0" takes them all the way back to what they call "Web 1.0" are buying airline stock in August of 2001.

[Uche Ogbuji]

via Copia

Agile Web #1: "Google Sitemaps"

"Google Sitemaps"

Uche Ogbuji's new XML.com column, "Agile Web," explores the intersection of agile programming languages and Web 2.0. In this first installment he examines Google's Sitemaps schema, as well as Python and XSLT code to generate site maps. [Oct. 26, 2005]

And with this article the "Python and XML" column has been replaced by a new one titled "Agile Web".

I wrote the Python-XML column for three years, discussing the combination of an agile programming language with an agile data format. It's time to pull the lens back a bit to take in other such technologies. This new column, "Agile Web," will cover the intersection of dynamic programming languages and web technologies, particularly the sorts of dynamic developments on the web for which some use the moniker, "Web 2.0." The primary language focus will still be Python, with some ECMAScript. Occasionally there will be some coverage of other dynamic languages as well.

In this first article I introduce the Google SiteMaps program, XML format and Python tools.

[Uche Ogbuji]

via Copia