(Un)quoted HTML attributes

In HTML, it’s perfectly valid to write something like this:

<p class=alpha>Lorem ipsum dolor sit amet etc.</p>

Indeed, Internet Explorer 7 and 8 favour this approach. Whereas Firefox, for example, would return the above as

<p class="alpha">Lorem ipsum dolor sit amet etc.</p>

when retrieving it using innerHTML, IE 7 & 8 only quotes attributes that satisfy certain criteria:

  • Any attributes that contain spaces;
  • Some magic set of standard attributes, such as href;
  • Any custom attributes you may have specified.

Unfortunately, however, this doesn’t constitute well-formed XML. In the software I develop at work, we produce PDFs that can include user-generated HTML. To do this, we use XSL:FO — an XML-based system. You can see where this is going: the backend requires valid XML, but the frontend is sending through HTML. The simplest way to fix this is with a simple regex, like so:

var s = '<p class=alpha>Lorem ipsum dolor sit amet etc.</p>';
s = s.replace(/=([^"'`>\s]+)/g, '="$1"');
// s === '<p class="alpha">Lorem ipsum dolor sit amet etc.</p>'

Bear in mind that this regex is by no means perfect. It will, for example, convert this:

<p class=alpha>Lorem ipsum</p>
<p>dolor sit=amet</p>

into this:

<p class="alpha">Lorem ipsum</p>
<p>dolor sit="amet</p">.

…which is obviously not what we want. I spent a while trying to come up with a regex that would solve this problem, but I stopped pretty soon. What’s really needed here is a parser: something that can take in the tag soup that Internet Explorer produces, and produce valid XHTML (which is valid XML). A quick search reveals myriad implementations in various languages — Python, Java, even JavaScript.

So, after all that, what’s the take-away from this post? Just this: web browsers (slightly older versions of Internet Explorer in particular) are imperfect. XML was borne out of HTML, and is much less forgiving; whether or not its strictness is a good thing is up for debate. I guess that it’s a bit like reading Shakespeare nowadays: you can pretty much understand it, but every now and then you have to reach for a dictionary to make sense of what’s going on. Of course, when you don’t understand something in Shakespeare, you don’t fall over in a heap, but let’s not stretch the analogy too far.

In brief

When retrieving the innerHTML of an element (or using contenteditable), Internet Explorer doesn’t always wrap attribute values in quotes. The solution to this is not a magnificently obtuse regex, but a tag-soup parser that can return valid XML.

1 thought on “(Un)quoted HTML attributes

  1. emateu says:

    Thanks, this simple regex helped me a lot… =). F*cking IE =)

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.