Clean Word markup

2010-01-20 – 12:21pm

When writing a web-app that accepts formatted input from users, you’ll often find that they will copy and paste text from Microsoft Word. Unfortunately, Word fills the markup with lots of unnecessary and unwanted muck. To clean this all up, I wrote the following function (directly implemented on the String prototype below):

String.implement({
	sanitiseWord: function() {
		var s = this.replace(/\r/g, '\n').replace(/\n/g, ' ');
		var rs = [];
		rs.push(/<!--.+?-->/g); // Comments
		rs.push(/<title>.+?<\/title>/g); // Title
		rs.push(/<(meta|link|.?o:|.?style|.?div|.?head|.?html|body|.?body|.?span|!\[)[^>]*?>/g); // Unnecessary tags
		rs.push(/ v:.*?=".*?"/g); // Weird nonsense attributes
		rs.push(/ style=".*?"/g); // Styles
		rs.push(/ class=".*?"/g); // Classes
		rs.push(/(&nbsp;){2,}/g); // Redundant &nbsp;s
		rs.push(/<p>(\s|&nbsp;)*?<\/p>/g); // Empty paragraphs
		rs.each(function(regex) {
			s = s.replace(regex, '');
		});
		return s.replace(/\s+/g, ' ');
	}
});

If you’re not using MooTools, the function will look something like this:

String.prototype.sanitiseWord = function() {
// function body here...
};

Usage

var s = "(some awful Word markup)".sanitiseWord();

In one of the tests I ran, the input went from around 7000 characters to just 700.

Example

Some of the regular expressions I used were adapted from C# ones in a post by Jeff Atwood.

The problem

When styling text <input> elements, it’s fairly common to run into a serious problem: they don’t behave like block-level elements.

Note: In all of the examples, the container element is filled with blue, and the <input> itself is filled with red and has an opacity of 50% so that you can see it under- or over-flowing the container.

<div  style="background: blue; width:200px;">
  <input  style="display:block; padding:4px; background: red; opacity:0.5; border:0;" type="text" value="text input"/>
</div>

You can see how the input doesn’t automatically flow to full width, as the “display: block” style suggests it should. The kneejerk response is to set the width to 100%:

<div  style="background: blue; width:200px;">
  <input  style="display:block; padding:4px; background: red; opacity:0.5; width:100%; border:0;" type="text" value="text input"/>
</div>

But notice now how the input overflows its container’s boundaries because of the left padding. At this point, people may resort to non-semantic markup (removing the padding on the <input> and putting it inside a padded <div>) or JavaScript solutions that set the pixel width whenever the container’s width changes (by the addition of scrollbars, for example).

The (semantic) solution

But wait! There is a way to achieve this effect without resorting to an extra <div> or JavaScript:

<div  style="background: blue; width:200px;">
  <input  style="display:block; padding:4px 0; background: red; opacity:0.5; width:100%; border:0; text-indent:4px;" type="text" value="text input"/>
</div>

Do you see what I did there? I removed the horizontal padding on the <input>, so the 100% width now works correctly, and replaced it with “text-indent”. To the user, this looks no different, and it has the advantage of requiring no extraneous markup or tedious scripting.

Drawbacks

  1. Should the user enter a long string, their text will bump up against the right edge. But I think that that’s a boundary condition that I can live with.
  2. Any vertical borders on the <input> will cause it to overflow its container. Personally, if I want a full-width <input>, though, I generally don’t want any borders on its left or right other than those of its container.

Often, you will need to prevent users from entering data that doesn’t conform to a specific pattern. For example, you may want to allow users to enter only numbers or only valid email addresses. To this end, I’ve written a little utility function that returns the “standardised” version of a string, according to the regex you supply.

String.implement({
	limitContent: function(allowedRegex) {
		return $splat(this.match(allowedRegex)).join('');
	}
});

Basically, the function takes the result of evaluating the regular expression on the string, converts it into an array if it isn’t one, and then joins the array’s elements together with an empty string.

Examples:

console.log("12345".limitContent(/.{4}/)); // Only allow four characters
console.log("joe@mail.com".limitContent(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}/)); // Only allow email addresses

Internet censorship: Letter #1

2009-12-23 – 7:32am

The letter below is written to my local MP regarding Australia’s proposed internet censoring.

Dear Mr Keenan,

I am writing to you concerning the soon to be trialed internet filtering scheme. As a resident of the City of Stirling and a UWA-qualified computer scientist, I have reservations over the efficacy, utility, impact, and morality of this initiative.

Firstly, the internet encompasses more than just web pages and web sites. The so-called “deep web” is thought to be many times larger than the “surface web” (that which can be examined by Google, for example). Where the majority of websites use the HTTP protocol to transfer data, which is subject to filtering under the proposed system, the deep web makes use of a wide variety of alternative data interchange systems, including torrents, UseNet, and VPNs. None of these will be filtered under the scheme, yet it is here that much of the undesirable content is to be found. [1]

Prior to the change of federal government, a system was in place whereby anyone could obtain free filtering software from the Australian government. This software ran on the computers themselves, and thus placed the onus for preventing unsuitable material from arriving on those in charge of the computers. In other words, parents were responsible for the well-being and safety of their children whilst online — in my opinion, a far more desirable state of affairs. [2]

The federal government has been extolling the virtues of high-speed internet across the country. Whilst I applaud this initiative, I have to question the sense of improving internet speeds across the country, only to then drastically reduce them by the introduction of mandatory internet filtering. In tests, it has been shown that filtering can reduce access speed by 10ms, and, due to bottle-neck difficulties, much longer times. Surely this is nonsensical. [3]

My final, and perhaps most significant, issue with the proposed implementation is that the “blacklist” of blocked sites will be inaccessible to the public. Australia is a nation founded on the ideals of a free, democratic, and transparent government. To make this list unavailable suggests that the filtering may be politically or privately motivated, politicians’ assurances notwithstanding. [4]

I ask that you carefully consider the issues I have raised, and that you stand and speak against this system.

Yours faithfully,

Barry van Oudtshoorn

[1] http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&th&emc=th
[2] http://www.netalert.gov.au/about_netalert.html
[3] http://www.efa.org.au/censorship/mandatory-isp-blocking/#SS_7
[4] http://www.thestandard.com/news/2008/10/13/no-opt-out-filtered-internet