HtmlCleaner Changelog

What's new in HtmlCleaner 2.6.1

Sep 6, 2013
  • Fixed Issue 90: Re-instating the HtmlCleaner's public instance method clean(Reader)

New in HtmlCleaner 2.2 (Jan 24, 2013)

  • HtmlCleaner is now thread-safe. Single instance can be used from multiple threads to parse multiple html sources safely. All serializers coming in the package are thread-safe as well.
  • Html-based serializers are introduced, intended to produce browser-friendly HTML. Now there are basically two serializer flavors: XML (simple, pretty, compact) and HTML (simple, pretty, compact). Html serializers doesn't strictly produce well-formed XML, but rather HTML for further browser consumption (for example special entities like Α are preserved, not escaped with Α, empty tags like script are not serialized as but rather as )
  • New parameter transResCharsToNCR is introduced, telling whether reserved XML characters (&, ", ', ) are serialized to their Numeric Character Representations (&#dd;)
  • New parameter transSpecialEntitiesToNCR is introduced, telling whether special HTML entities (Α for example) are serialized to their Numeric Character Representations (&#dd;)
  • Parameter omitHtmlEnvelope gets deprecated and new parameter omitEnvelope in command line/Ant and optional parameter in methods XXXSerializer.writeToXXX() is introduced instead, moving this logic to the right place. This way the whole body wihout enclosing tags is serialized, not only the first inner node as before.
  • List of special HTML entities is extended with number of new ones. Class SpecialEntity holding them has public method addEntity(entityName, entityCode) to define new ones if some are still missing.
  • TagNode has number of new methods for easier node manipulation (see API docs)
  • Visitor concept is implemented in TagNode in order to easily traverse DOM tree and collect some data/update the document.
  • Pretty XML/HTML serializers have optional parameter in constructors specifying indentation string (default is TAB character).
  • Tag definitions updated (col, legend...) to be consistent with the browsers.
  • Invalid XML characters are skipped during parsing/serialization.
  • DOM/JDom serialization bug fixes.
  • Namespaces found in source HTML are now handled properly (depending on omitXmlnsAttributes parameter).
  • Method HtmlClenaer.getAllTags() is removed, since this approach doesn't go with introduced thread-safety.
  • Few classes are renamed: ContentToken -> ContentNode, CommentToken -> CommentNode.
  • Parameter ignoreQuestAndExclam has now default value true.
  • Source code now has standard MAVEN structure.

New in HtmlCleaner 2.1 (Jul 28, 2009)

  • Parsing transformations are developed in order to easily skip or change specified tags or attributes during the cleanup process.
  • Few more constructors added in class HtmlCleaner giving possibility to reuse same cleaner properties with multiple cleaner instances.
  • Code cleanup.