Clean Word markup

When writing a web-app that accepts formatted input from users, you’ll often find that they will copy and paste text from Microsoft Word. Unfortunately, Word fills the markup with lots of unnecessary and unwanted muck. To clean this all up, I wrote the following function (directly implemented on the String prototype below):

String.implement({
	sanitiseWord: function() {
		var s = this.replace(/\r/g, '\n').replace(/\n/g, ' ');
		var rs = [];
		rs.push(/<!--.+?-->/g); // Comments
		rs.push(/<title>.+?<\/title>/g); // Title
		rs.push(/<(meta|link|.?o:|.?style|.?div|.?head|.?html|body|.?body|.?span|!\&#91;)&#91;^>]*?>/g); // Unnecessary tags
		rs.push(/ v:.*?=".*?"/g); // Weird nonsense attributes
		rs.push(/ style=".*?"/g); // Styles
		rs.push(/ class=".*?"/g); // Classes
		rs.push(/(&nbsp;){2,}/g); // Redundant &nbsp;s
		rs.push(/<p>(\s|&nbsp;)*?<\/p>/g); // Empty paragraphs
		rs.each(function(regex) {
			s = s.replace(regex, '');
		});
		return s.replace(/\s+/g, ' ');
	}
});

If you’re not using MooTools, the function will look something like this:

String.prototype.sanitiseWord = function() {
// function body here...
};

Usage

var s = "(some awful Word markup)".sanitiseWord();

In one of the tests I ran, the input went from around 7000 characters to just 700.

Example

Some of the regular expressions I used were adapted from C# ones in a post by Jeff Atwood.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.