net.sourceforge.htmlcleaner » htmlcleaner
HtmlCleaner is an HTML parser written in Java. It transforms dirty HTML to well-formed XML following the same rules that most web-browsers use.