jsoup Changelog

What's new in jsoup 1.8.2

Apr 16, 2015
  • Performance improvements for parsing HTML on Android, of 1.5x to 1.9x, with larger parses getting a bigger speed increase. For non-Android JREs, around 1.1x to 1.2x.
  • Dramatic performance improvement in HTML serialization on Android (KitKat and later), of 115x. Improvement by working around a character set encoding speed regression in Android.
  • Performance improvement for the class name selector on Android (.class) of 2.5x to 14x. Around 1.2x on non-Android JREs.
  • File upload support. Added the ability to specify input streams for POST data, which will upload content in MIME multipart/form-data encoding.
  • Add a meta-charset element to documents when setting the character set, so that the document's charset is unambiguous.
  • Added ability to disable TLS (SSL) certificate validation. Helpful if you're hitting a host with a bad cert, or your JDK doesn't support SNI.
  • Added ability to further tweak the canned Cleaner Whitelists by removing existing settings.
  • Added option in Cleaner Whitelist to allow linking to in-page anchors (#)
  • Use a lowercase doctype tag for HTML5 documents.
  • Add support for 201 Created with redirect, and other status codes. Treats any HTTP status code 2xx or 3xx as an OK response, and follow redirects whenever there is a Location header.
  • Added support for HTTP method verbs PUT, DELETE, and PATCH.
  • Added support for overriding the default POST character of UTF-8
  • W3C DOM support: added ability to convert from a jsoup document to a W3C document, with the W3Dom helper class.
  • In the HtmlToPlainText example program, added the ability to filter using a CSS selector. Also clarified the usage documentation.
  • Fixed validation of cookie names in HttpConnection cookie methods.
  • Fixed an issue where tags would be missed when preparing a form for submission if missing a selected attribute.
  • Fixed an issue where submitting a form would incorrectly include radio and checkbox values without the checked attribute.
  • Fixed an issue where Element.classNames() would return a set containing an empty class; and may have extraneous whitespace.
  • Fixed an issue where attributes selected by value were not correctly space normalized.
  • In head+noscript elements, treat content as character data, instead of jumping out of head parsing.
  • Fixed performance issue when parsing HTML with elements with many children that need re-parenting.
  • Fixed an issue where a server returning an unsupport character set response would cause a runtime UnsupportedCharsetException, instead of falling back to the default UTF-8 charset.
  • Fixed an issue where Jsoup.Connection would throw an IO Exception when reading a page with zero content-length.
  • Improved the equals() and hashcode() methods in Node, to consider all their child content, for DOM tree comparisons.
  • Improved performance in Selector when searching multiple roots.

New in jsoup 1.8.1 (Sep 29, 2014)

  • Introduced the ability to chose between HTML and XML output, and made HTML the default. This means img tags are output as , not . XML is the default when using the XmlTreeBuilder. Control this with the Document.OutputSettings.syntax() method.
  • Improved the performance of Element.text() by 3.2x
  • Improved the performance of Element.html() by 1.7x
  • Improved file read time by 2x, giving around a 10% speed improvement to file parses.
  • Tightened the scope of what characters are escaped in attributes and textnodes, to align with the spec. Also, when using the extended escape entities map, only escape a character if the current output charset does not support it.
  • This produces smaller, more legible HTML, with greater control over the output (by setting charset and escape mode).
  • If pretty-print is disabled, don't trim outer whitespace in Element.html()
  • In the HTML Cleaner, allow span tags in the basic whitelist, and span and div tags in the relaxed whitelist.
  • Added Element.cssSelector(), which returns a unique CSS selector/path for an element.
  • Fixed an issue where was parsed as
  • Fixed an issue where a UTF-8 BOM character was not detected if the HTTP response did not specify a charset, and the HTML body did, leading to the head contents incorrectly being parsed into the body. Changed the behavior so that when the UTF-8 BOM is detected, it will take precedence for determining the charset to decode with.
  • Relaxed doctype validation, allowing doctypes to not specify a name.
  • Fixed an issue in parsing a base URI when loading a URL containing a http-equiv element.
  • Fixed an issue for Java 1.5 / Android 2.2 compatibility, and verify it doesn't regress.
  • Fixed an issue that would throw an NPE when trying to set invalid HTML into a title element.
  • Added support for quoted attribute values in CSS Selectors
  • Fixed support for nth-of-type selectors with unknown tags.
  • Added support for 'application/*+xml' mimetypes.
  • Fixed support for allowing script tags in cleaner whitelists.

New in jsoup 1.7.3 (Sep 29, 2014)

  • Introduced FormElement, providing easy access to form controls and their data, and the ability to submit forms with Jsoup.Connect.
  • Reduced GC impact during HTML parsing, with 17% fewer objects created, and 3% faster parses.
  • Reduced CSS selection time by 26% for common queries.
  • Improved HTTP character set detection.
  • Added Document.location, to get the URL the document was retrieved from. Helpful if connection was redirected.
  • Fixed support for self-closing script tags.
  • Fixed a crash when reading an unterminated CDATA section.
  • Fixed an issue where elements added via the adoption agency algorithm did not preserve their attributes.
  • Fixed an issue when cloning a document with extremely nested elements that could cause a stack-overflow.
  • Fixed an issue when connecting or redirecting to a URL that contains a space.
  • Added support for the HTTP/1.1 Tempory Redirect (307) status code.

New in jsoup 1.7.2 (Apr 11, 2013)

  • Added support for supplementary characters outside of the Basic Multilingual Plane.
  • Added support for structural pseudo CSS selectors, including :first-child, :last-child, :nth-child, :nth-last-child,
  • :first-of-type, :last-of-type, :nth-of-type, :nth-last-of-type, :only-child, :only-of-type, :empty, and :root
  • Added a maximum body response size to Jsoup.Connection, to prevent running out of memory when trying to read
  • extremely large documents. The default is 1MB.
  • Refactored the Cleaner to traverse rather than recurse child nodes, to avoid the risk of overflowing the stack.
  • Added Element.insertChildren(), to easily insert a list of child nodes at a specific index.
  • Added Node.childNodesCopy(), to create an independent copy of a Node's children.
  • When parsing in XML mode, preserve XML declarations ().
  • Introduced Parser.parseXmlFragment(), to allow easy parsing of XML fragments.
  • Allow Whitelist test methods to be extended
  • Added Document.OutputSettings.outline mode, to aid HTML debugging by printing out in outline mode, similar to browser HTML inspectors.
  • When parsing, allow all tags to self-close. Tags that aren't expected to self-close will get an end tag.
  • Fixed an issue when parsing /RCData tags containing unescaped closing tags that would drop the traling >.
  • Corrected the javadoc for Element#child() to note that it throws IndexOutOfBounds.
  • When cloning an Element, reset the classnames set so as not to hold a pointer to the source's.
  • Limit how far up the stack the formatting adoption agency algorithm will travel, to prevent the chance of a run-away parse when the HTML stack is hopelessly deep.
  • Modified Element.text() to build text by traversing child nodes rather than recursing. This avoids stack-overflow errors when the DOM is very deep and the VM stack-size is low.

New in jsoup 1.7.1 (Dec 19, 2012)

  • Added a maximum body response size to Jsoup.Connection, to prevent running out of memory when trying to read extremely large documents.

New in jsoup 1.6.2 (Mar 28, 2012)

  • Added a simplified XML parsing mode, which can usefully parse valid and invalid XML, but does not enforce any HTML document structure or special tag behaviour.
  • Added the optional ability to track errors when tokenising and parsing.
  • Added jsoup.connect.cookies(Map) method, to set multiple cookies at once, possibly from a prior request.
  • Added Element.textNodes() and Element.dataNodes(), to easily access an element's children text nodes and data nodes.
  • Added an example program that demonstrates how to format HTML as plain-text, and the use of the NodeVisitor interface.
  • Added Node.traverse() and Elements.traverse() methods, to iterate through a node's descendants.
  • Updated jsoup.connect so that when requests made as POSTs are redirected, the redirect is followed as a GET.
  • Updated the Cleaner and whitelists to optionally preserve related links in elements, instead of converting them to absolute links.
  • Updated the Cleaner to support custom allowed protocols such as "cid:" and "data:".
  • Updated handling of tags, to act on only the first one seen when parsing, to align with modern browsers.
  • Updated Node.setBaseUri(), to recursively set on all the node's descendants.
  • Fixed handling of null characters within comments.
  • Tweaked escaped entity detection in attributes to not treat &entity_... as an entity form.
  • Fixed doctype tokeniser to allow whitespace between name and public identifier.
  • Fixed issue where comments within a table tag would be duplicate-fostered into body.
  • Fixed an issue where a spurious byte-order-mark at the start of a document would cause the parser to miss head contents.
  • Fixed an issue where content after a frameset could cause a NPE crash. Now correctly implements spec and ignores the trailing content.
  • Tweaked whitespace checks to align with HTML spec
  • Tweaked HTML output of closing script and style tags to not add an extraneous newline when pretty-printing.
  • Substantially reduced default memory allocation within Node.outerHtml, to reduce memory pressure when serialising smaller DOMs.

New in jsoup 1.2.3 (Aug 5, 2010)

  • HTML5 support:
  • While jsoup has always included implicit support for HTML5 tags, this release introduces explicit tag definitions. This ensures that when out-of-spec HTML5 is found (e.g. badly nested, or incorrectly parented), jsoup will create an in-spec parse tree.
  • HTML5 Datasets are now supported with the Element.dataset() method that provides a convenient map view of an element's dataset.
  • Improved international support
  • When parsing HTML from a file or a URL, jsoup will now automatically detect the document's character set, and decode the input appropriately before parsing.
  • You can also also define the document's output character set with the Document.outputSettings().charset(String) method. This controls which characters will be HTML escaped on output, and which will be kept as-is. The output charset defaults to the input charset.
  • Other improvements and bug fixes:
  • I've added two new selectors:
  • namespace|element finds elements by tagname in a namespace
  • [^attributePrefix] finds elements that have an attribute name starting with a prefix
  • Also:
  • Added support for namespaced elements () and selectors to find them (fb|name)
  • Implemented the Node.ownerDocument() DOM method
  • Improved implicit table element handling (particularly around thead, tbody, and tfoot).
  • Improved HTML output format for empty elements and auto-detected self closing tags
  • Changed DT & DD tags to block-mode tags, to follow practice over spec
  • Added support for tag names with - and _ (, )
  • Handle tags with internal trailing space ()
  • Fixed support for character class regular expressions in the [attr=~regex] selector

New in jsoup 1.2.1 (Jul 1, 2010)

  • Added .before(html) and .after(html) methods to Element and Elements, to insert sibling HTML
  • Added :contains(text) selector, to search for elements containing the specified text
  • Added :has(selector) pseudo-selector
  • Added Element#parents and Elements#parents to retrieve an element's ancestor chain
  • Fixes an issue where appending / prepending rows to a table (or to similar implicit element structures) would create a redundant wrapping elements
  • Improved implicit close tag heuristic detection when parsing malformed HTML
  • Fixes an issue where text content after a script (or other data-node) was incorrectly added to the data node.
  • Fixes an issue where text order was incorrect when parsing pre-document HTML.