When writing a web-app that accepts formatted input from users, you’ll often find that they will copy and paste text from Microsoft Word. Unfortunately, Word fills the markup with lots of unnecessary and unwanted muck. To clean this all up, I wrote the following function (directly implemented on the String prototype below):
String.implement({
sanitiseWord: function() {
var s = this.replace(/\r/g, '\n').replace(/\n/g, ' ');
var rs = [];
rs.push(/<!--.+?-->/g); // Comments
rs.push(/<title>.+?<\/title>/g); // Title
rs.push(/<(meta|link|.?o:|.?style|.?div|.?head|.?html|body|.?body|.?span|!\[)[^>]*?>/g); // Unnecessary tags
rs.push(/ v:.*?=".*?"/g); // Weird nonsense attributes
rs.push(/ style=".*?"/g); // Styles
rs.push(/ class=".*?"/g); // Classes
rs.push(/( ){2,}/g); // Redundant s
rs.push(/<p>(\s| )*?<\/p>/g); // Empty paragraphs
rs.each(function(regex) {
s = s.replace(regex, '');
});
return s.replace(/\s+/g, ' ');
}
});
If you’re not using MooTools, the function will look something like this:
String.prototype.sanitiseWord = function() {
// function body here...
};
Usage
var s = "(some awful Word markup)".sanitiseWord();
In one of the tests I ran, the input went from around 7000 characters to just 700.
Example
Some of the regular expressions I used were adapted from C# ones in a post by Jeff Atwood.