Mystery of Google index drop solved?

Update. Corrected Christian's surname. Sorry, man.

A while ago I complained that uche.ogbuji.net disppeared from Google search results soon after I went to a CherryPy-based server. I'm up to say that I'm a goof, but I hope that admitting my silly error might save someone else some head-scratching (maybe even this gentleman)

I'm at least not alone in my error. The clue came from this message by the very smart Christian Wyglendowski In my case I was getting 404s for most things, but I did have a bug that was causing a 500 error on requests to robots.txt. Apparently the Google bot shuns sites with that problem. I can understand that but it's interesting that Yahoo doesn't seem to do the same thing, since my ranking didn't drop much there. I fixed the bug and then submitted a reinclusion request to Google following the suggestions in this article (I guess SEO advice isn't a completely parasitic endeavor). The body of my message was as follows:

I had a bug causing 500 error on robots.txt request, and I think that's why I got dropped from your index. I've fixed that bug, and would like to request reinclusion to your index. Thanks.

We'll see if that does the trick.

[Uche Ogbuji]

via Copia

CherryPy QOTW

OK there are many fun zingers from the latest CherryPy/WSGI/Paste tiff, but this one resonated soundly with me:

In times BC (Before CherryPy ;) I would simply write stuff in PHP because it was easier than fitting my square mind in some Python framework's tetrahegon shaped hole.

—Christian Dowski, Captain Buffet

Such an L7. :-) And I don't know why that puts in my head the nonce construction "techno-hadron".

[Uche Ogbuji]

via Copia

Planet Atom

Planet Atom is now live.

Planet Atom focuses Atom streams from authors with an affinity for syndication and Atom-specific issues. This site was developed by Sylvain Hellegouarch, Uche Ogbuji, and John L. Clark. Please visit the Planet Atom development site if you are interested in the source code. The complete list of sources is maintained in XBEL format (with some experimental extensions); please contact one of the site developers if you want to suggest a modification to this list.

John, Sylvain and I have been working at this on and off for over a month now (we've all been swamped with other things—the actual development of the site was fairly straightforward). Planet Atom is built on an aggregation from Atom 1.0 feeds into one larger feed (with entries collated, trimmed etc.) It's built on 4Suite (for XSLT processing), CherryPy (for Web serving), Amara (for Atom feed slicing and dicing), atomixlib (for building the aggregate feed) and dateutil (for date wrangling), with Python and XML as the twin foundations, of course. Thanks to folks on the #atom and #swhack IRC channels for review and feedback.

[Uche Ogbuji]

via Copia

Thinking XML #34: Search engine enhancement using the XML WordNet server system

Updated—Fixed link to "Serving up WordNet as XML"

"Thinking XML: Search engine enhancement using the XML WordNet server system"

Subtitle: Also, use XSLT to create an RDF/XML representation of the WordNet data
Synopsis: In previous installments of this column, Uche Ogbuji introduced the WordNet natural language database, and showed how to represent database nodes as XML and serve this XML though the Web. In this article, he shows how to convert this XML to an RDF representation, and how to use the WordNet XML server to enrich search engine technology.

This is the final part of a mini-series within the column. The previous articles are:

In this article I write my own flavor of RDF schema for WordNet, a transform for conversion from the XML format presented previously, and a little demo app that shows how you can use WordNet to enhance search with synonym capabilities (and this time it's a much faster approach).

I hope to publicly host the WordNet server I've developed in this series once I get my home page's CherryPy setup updated for 2.2.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, RDF equivalents for the WordNet/XML.

[Uche Ogbuji]

via Copia

Today's XML WTF: Internal entites in browsers

This unnecessary screw-up comes from the Mozilla project, of all places. Mozilla's XML support is improving all the time, as I discuss in my article on XML in Firefox, but the developer resources seem to lag the implementation, and this often leads to needless confusion. One that I ran into recently could perhaps be given the summary: "not everything in the Mozilla FAQ is accurate". From the Mozilla FAQ:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support entities (other than the five pre-defined ones), the use of entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.)

See the part in bold. Someone either didn't read the spec, or is intentionally throwing up a spec distortion field. The XML 1.0 spec provides a table in section 4.4: "XML Processor Treatment of Entities and References" which tells you how parsers are allowed to treat entities, and it flatly contradicts the bogus Mozilla FAQ statement above.

The main reason for the "WTF" is the fact that the Mozilla implementation actually gets it right. That it should. It's based on Expat. AFAIK Expat has always got this right (I've been using Expat about as long as the Mozilla project has been), so I'm not sure what inspired the above error. Mozilla should be touting its correct and useful behavior, rather than giving bogus excuses to its competitors.

This came up last week in the IBM developerWorks forum where a user was having problems with internal entities in XHTML. It turns out that he was missing an XHTML namespace (and based on my experimentation was probably serving up XHTML as text/html which is generally a no-no). It should have been a clear case of "Mozilla gets this right, and can we please get other browsers to fix their bugs?" but he found that FAQ entry and we both ended up victims of the red herring for a little while.

I didn't realize that the Mozilla implementation was right until I wrote a careful test case in preparation for my next Firefox/XML article. The following CherryPy code is a test server set-up for browser rendering of XHTML.

import cherrypy

INTENTITYXHTML = '''\
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "http://www.w3.org/TR/xhtml/DTD/xhtml1-strict.dtd" [
<!ENTITY internal "This is text placed as internal entity">
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
  <head>
    <title>Using Entity in xhtml</title>
  </head>
  <body>
    <p>This is text placed inline</p>
    <p>&internal;</p>
    <abbr title="&internal;">Titpaie</abbr>
  </body>
</html>
'''

class root:
    @cherrypy.expose
    def text_html(self):
        cherrypy.response.headerMap['Content-Type'] = "text/html; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def text_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "text/xml; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def app_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xml; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def app_xhtml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xhtml+xml; charset=utf-8"
        return INTENTITYXHTML

cherrypy.root = root()
cherrypy.config.update({'server.socketPort': 9999})
cherrypy.config.update({'logDebugInfoFilter.on': False})
cherrypy.server.start()

As an example, this code serves up a content type text/html when accessed through a URL such as http://localhost:9999/text_html. You should be able to work out the other URL to content type mappings from the code, even if you're not familiar with CherryPy or Python.

Firefox 1.0.7 handles all this very nicely. For text_xml, app_xml and app_xhtml you get just the XHTML rendering you'd expect, including the correct text in the attribute value with the mouse hovered over "Titpaie".

IE6 (Windows) and Safari 1.3.1 (OS X Panther) both have a lot of trouble with this.

IE6 in the text_xml and app_xml cases complains that it can't find http://www.w3.org/TR/xhtml/DTD/xhtml1-strict.dtd. In the app_xhtml case it treats the page as a download, which is reasonable, if not convenient.

Safari in the text_xml, app_xml and app_xhtml cases complains that the entity internal is undefined (??!!).

IE6, Safari and Mozilla in the text_html case all show the same output (looking, as it should, like busted HTML). That's just what you'd expect for a tag soup mode, and emphasizes hat you should leave text_html out of your XHTML vocabulary.

All this confusion and implementation difference illustrates the difficulty for folks trying to deploy XHTML, and why it's probably not yet realistic to deploy XHTML without some sort of browser sniffing (perhaps by checking the Accept header, though it's well known that browsers are sometimes dishonest with this header). I understand that the MSIE7 team hopes to address such problems. I don't know whether to expect the same from Safari. My focus in research and experimentation has been on Firefox.

One final note is that Mozilla does not support external parsed entities. This is legal (and some security experts claim even prudent). The relevant part of the XML 1.0 spec is section 4.4.3:

When an XML processor recognizes a reference to a parsed entity, in order to validate the document, the processor MUST include its replacement text. If the entity is external, and the processor is not attempting to validate the XML document, the processor MAY, but need not, include the entity's replacement text. If a non-validating processor does not include the replacement text, it MUST inform the application that it recognized, but did not read, the entity.

I would love Mozilla to adopt the idea in the next spec paragraph:

Browsers, for example, when encountering an external parsed entity reference, might choose to provide a visual indication of the entity's presence and retrieve it for display only on demand.

That would be very useful. I wonder whether it would be possible through a Firefox plug-in (probably not: I guess it would require very tight Expat integration for plug-ins).

[Uche Ogbuji]

via Copia

PyBlosxom...CherryPy...hmmm

I have so much hacking to do on Copia's engine that it puts me off doing anything at all. The atom rendering bug I stumbled upon yesterday especially needs a look-in, but I suspect it would take a move from flavor to plug-in to fix it. I'm just honestly not all that bullish about hacking PyBlosxom right now. Not while I've been having so much fun with CherryPy lately.

A while ago Bill Mill said in a comment here:

I'm about halfway done with a "pyblosxom in cherrypy" thing I've been working on. It works, reads all my pyblosxom blogs, and can leave comments. I just need to refactor it to be more sensible and plugin-oriented.

That's the sort of sign I need in order to to hang on, but I do hope Bill works his way through the remaining half soon enough.

Speaking of CherryPy, recently spotted Cookbook entry: "A simple integration of a CherryPy web server, using Quixote template publishing, managed in its own thread."

[Uche Ogbuji]

via Copia

Thinking XML #33: Serving up WordNet as XML

"Thinking XML: Serving up WordNet as XML"

Subtitle: Build the basic WordNet/XML facilities into a Web server framework
Synopsis: A few articles back, Uche Ogbuji discussed WordNet 2.0, a Princeton University project that aims to build a database of English words and lexical relationships between them. He showed how to extract XML serializations from the word database. In this article he continues the exploration, demonstrating code to serve up these WordNet/XML documents over Web protocols and showing you how to access these from XSLT.

This is the second part of a mini-series within the column. The previous article is "Querying WordNet as XML,", in which I present Python code for processing WordNet 2.0 into XML. This time I use CherryPy to expose the XML on the Web, either in human-readable or in raw form. This seems to be part of a nice trend of CherryPy on developerWorks. I hope people see this as yet another example of how easy and clean CherryPy is.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, RDF equivalents for the WordNet/XML.

[Uche Ogbuji]

via Copia

Python/XML community: Amara, lxml and Picket

Amara XML Toolkit 1.0b3
lxml 0.7
Picket 0.4

Amara XML Toolkit 1.0b3 "is a collection of Python tools for XML processing—not just tools that happen to be written in Python, but tools built from the ground up to use Python idioms and take advantage of the many advantages of Python. Amara builds on 4Suite [http://4Suite.org], but whereas 4Suite focuses more on literal implementation of XML standards in Python, Amara focuses on Pythonic idiom." In this release:

  • Add xmlsetattribute method to elements, in order to allow adding attributes with namespaces or with illegal Python names
  • Update manual source for markdown, and extensive improvements to the manual (with much help from Jamie Norrish)
  • Add xml_doc facility for nodes
  • Fix support for output parameters in xml()
  • Add support for rules to pushbind
  • Improve XSLT support for bindery objects (see demo/bindery/xslt.py)
  • Bug fixes

lxml 0.7 is an alternative, more Pythonic binding for the libxml2 and libxslt XML processing libraries. Martijn Faassen says "lxml 0.7 is a release with quite a few new features and bug fixes, including XPath expression parameters, XInclude support, XMLSchema validation support, more namespace prefix support, better encoding support, and more."

Sylvain Hellegouarch updated Picket, a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job. This update is mostly in order to support CherryPy development snapshots that are soon to become CherryPy 2.1. A CherryPy "filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain."

[Uche Ogbuji]

via Copia