I was asked today to write a short autobiography to go on the company website. This is what I wrote:

Barry van Oudtshoorn hails from the snow-clad plains of the Serengeti delta. Overcoming his debilitating muteness (Barry was born without a larynx), he has become an accomplished ventriloquist and master orator. He has presided over Presidential coming-of-age ceremonies and the marriages of seventeen Catholic priests, both in Uzbekistan and abroad.

The winner of the inaugural bi-annual toothbrush-depilation contest in 2010 (and a runner up the year before that), Barry leads a quiet life. Despite being married to Ariel, he has led a comfortable and well-cooked life. Barry enjoys making jellies wobble and cutting his toenails.

I’m not sure that this will be what goes up, though…

(Un)quoted HTML attributes

In HTML, it’s perfectly valid to write something like this:

<p class=alpha>Lorem ipsum dolor sit amet etc.</p>

Indeed, Internet Explorer 7 and 8 favour this approach. Whereas Firefox, for example, would return the above as

<p class="alpha">Lorem ipsum dolor sit amet etc.</p>

when retrieving it using innerHTML, IE 7 & 8 only quotes attributes that satisfy certain criteria:

  • Any attributes that contain spaces;
  • Some magic set of standard attributes, such as href;
  • Any custom attributes you may have specified.

Unfortunately, however, this doesn’t constitute well-formed XML. In the software I develop at work, we produce PDFs that can include user-generated HTML. To do this, we use XSL:FO — an XML-based system. You can see where this is going: the backend requires valid XML, but the frontend is sending through HTML. The simplest way to fix this is with a simple regex, like so:

var s = '<p class=alpha>Lorem ipsum dolor sit amet etc.</p>';
s = s.replace(/=([^"'`>s]+)/g, '="$1"');
// s === '<p class="alpha">Lorem ipsum dolor sit amet etc.</p>'

Bear in mind that this regex is by no means perfect. It will, for example, convert this:

<p class=alpha>Lorem ipsum</p>
<p>dolor sit=amet</p>

into this:

<p class="alpha">Lorem ipsum</p>
<p>dolor sit="amet</p">.

…which is obviously not what we want. I spent a while trying to come up with a regex that would solve this problem, but I stopped pretty soon. What’s really needed here is a parser: something that can take in the tag soup that Internet Explorer produces, and produce valid XHTML (which is valid XML). A quick search reveals myriad implementations in various languages — Python, Java, even JavaScript.

So, after all that, what’s the take-away from this post? Just this: web browsers (slightly older versions of Internet Explorer in particular) are imperfect. XML was borne out of HTML, and is much less forgiving; whether or not its strictness is a good thing is up for debate. I guess that it’s a bit like reading Shakespeare nowadays: you can pretty much understand it, but every now and then you have to reach for a dictionary to make sense of what’s going on. Of course, when you don’t understand something in Shakespeare, you don’t fall over in a heap, but let’s not stretch the analogy too far.

In brief

When retrieving the innerHTML of an element (or using contenteditable), Internet Explorer doesn’t always wrap attribute values in quotes. The solution to this is not a magnificently obtuse regex, but a tag-soup parser that can return valid XML.