While creating WYSIWYG editor fields for CMS engines I’ve often had the issue of clients pasting in files from Microsoft Word which somehow applies all kinds of unwanted formatting that either just carries over the ugliness of their original document or screws up the web layout and semantic correctness completely.

I’ve come up with this function to remove extra formatting from HTML WYSIWYG editor input such as TinyMCE.

	/**
	* Remove HTML tags, including invisible text such as style and
	* script code, and embedded objects.  Add spaces around
	* block-level tags to prevent word joining after tag removal.
	*/
	function strip_html_tags( $text )
	{
	$text = preg_replace(
	array(
	// Remove invisible content
	'@<head[^>]*?>.*?</head>@siu',
	'@<style[^>]*?>.*?</style>@siu',
	'@<script[^>]*?.*?</script>@siu',
	'@<object[^>]*?.*?</object>@siu',
	'@<embed[^>]*?.*?</embed>@siu',
	'@<applet[^>]*?.*?</applet>@siu',
	'@<noframes[^>]*?.*?</noframes>@siu',
	'@<noscript[^>]*?.*?</noscript>@siu',
	'@<noembed[^>]*?.*?</noembed>@siu',
	'/class=(.*)Mso(.*)"/',
	'/class=(.*)mso(.*)"/',
	'/style=(.*)"/',
	'/<!--(.*)-->/',
	),
	array(
	' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', '', '', '', ''
	),
	$text );
	$text = str_replace( "&lt;!--", "<!--", $text );
	$text = str_replace( "--&gt;", "-->", $text );
	$text = str_replace( "<style>", "", $text );
	$text = str_replace( "</style>", "", $text );
	
	return strip_tags( $text, '<address><blockquote><del><div><h1><h2><h3><h4><h5><h6><ins><p><a><b><i><u><img><pre><dl><dt><dd><li><ol><ul><table><tr><th><td><caption><abbr><acronym><span><strong><em>' );
	} // end strip_html_tags

Do you guys have any ideas?

P.S. I had a hell of a time trying to paste this into WordPress even. I guess something might need to be done there.

By Lilithe

Dork.