What's new in jsoup 1.8.2
Apr 16, 2015
- Performance improvements for parsing HTML on Android, of 1.5x to 1.9x, with larger parses getting a bigger speed increase. For non-Android JREs, around 1.1x to 1.2x.
- Dramatic performance improvement in HTML serialization on Android (KitKat and later), of 115x. Improvement by working around a character set encoding speed regression in Android.
- Performance improvement for the class name selector on Android (.class) of 2.5x to 14x. Around 1.2x on non-Android JREs.
- File upload support. Added the ability to specify input streams for POST data, which will upload content in MIME multipart/form-data encoding.
- Add a meta-charset element to documents when setting the character set, so that the document's charset is unambiguous.
- Added ability to disable TLS (SSL) certificate validation. Helpful if you're hitting a host with a bad cert, or your JDK doesn't support SNI.
- Added ability to further tweak the canned Cleaner Whitelists by removing existing settings.
- Added option in Cleaner Whitelist to allow linking to in-page anchors (#)
- Use a lowercase doctype tag for HTML5 documents.
- Add support for 201 Created with redirect, and other status codes. Treats any HTTP status code 2xx or 3xx as an OK response, and follow redirects whenever there is a Location header.
- Added support for HTTP method verbs PUT, DELETE, and PATCH.
- Added support for overriding the default POST character of UTF-8
- W3C DOM support: added ability to convert from a jsoup document to a W3C document, with the W3Dom helper class.
- In the HtmlToPlainText example program, added the ability to filter using a CSS selector. Also clarified the usage documentation.
- Fixed validation of cookie names in HttpConnection cookie methods.
- Fixed an issue where tags would be missed when preparing a form for submission if missing a selected attribute.
- Fixed an issue where submitting a form would incorrectly include radio and checkbox values without the checked attribute.
- Fixed an issue where Element.classNames() would return a set containing an empty class; and may have extraneous whitespace.
- Fixed an issue where attributes selected by value were not correctly space normalized.
- In head+noscript elements, treat content as character data, instead of jumping out of head parsing.
- Fixed performance issue when parsing HTML with elements with many children that need re-parenting.
- Fixed an issue where a server returning an unsupport character set response would cause a runtime UnsupportedCharsetException, instead of falling back to the default UTF-8 charset.
- Fixed an issue where Jsoup.Connection would throw an IO Exception when reading a page with zero content-length.
- Improved the equals() and hashcode() methods in Node, to consider all their child content, for DOM tree comparisons.
- Improved performance in Selector when searching multiple roots.
New in jsoup 1.8.1 (Sep 29, 2014)
- Introduced the ability to chose between HTML and XML output, and made HTML the default. This means img tags are output as , not . XML is the default when using the XmlTreeBuilder. Control this with the Document.OutputSettings.syntax() method.
- Improved the performance of Element.text() by 3.2x
- Improved the performance of Element.html() by 1.7x
- Improved file read time by 2x, giving around a 10% speed improvement to file parses.
- Tightened the scope of what characters are escaped in attributes and textnodes, to align with the spec. Also, when using the extended escape entities map, only escape a character if the current output charset does not support it.
- This produces smaller, more legible HTML, with greater control over the output (by setting charset and escape mode).
- If pretty-print is disabled, don't trim outer whitespace in Element.html()
- In the HTML Cleaner, allow span tags in the basic whitelist, and span and div tags in the relaxed whitelist.
- Added Element.cssSelector(), which returns a unique CSS selector/path for an element.
- Fixed an issue where was parsed as
- Fixed an issue where a UTF-8 BOM character was not detected if the HTTP response did not specify a charset, and the HTML body did, leading to the head contents incorrectly being parsed into the body. Changed the behavior so that when the UTF-8 BOM is detected, it will take precedence for determining the charset to decode with.
- Relaxed doctype validation, allowing doctypes to not specify a name.
- Fixed an issue in parsing a base URI when loading a URL containing a http-equiv element.
- Fixed an issue for Java 1.5 / Android 2.2 compatibility, and verify it doesn't regress.
- Fixed an issue that would throw an NPE when trying to set invalid HTML into a title element.
- Added support for quoted attribute values in CSS Selectors
- Fixed support for nth-of-type selectors with unknown tags.
- Added support for 'application/*+xml' mimetypes.
- Fixed support for allowing script tags in cleaner whitelists.
New in jsoup 1.7.3 (Sep 29, 2014)
- Introduced FormElement, providing easy access to form controls and their data, and the ability to submit forms with Jsoup.Connect.
- Reduced GC impact during HTML parsing, with 17% fewer objects created, and 3% faster parses.
- Reduced CSS selection time by 26% for common queries.
- Improved HTTP character set detection.
- Added Document.location, to get the URL the document was retrieved from. Helpful if connection was redirected.
- Fixed support for self-closing script tags.
- Fixed a crash when reading an unterminated CDATA section.
- Fixed an issue where elements added via the adoption agency algorithm did not preserve their attributes.
- Fixed an issue when cloning a document with extremely nested elements that could cause a stack-overflow.
- Fixed an issue when connecting or redirecting to a URL that contains a space.
- Added support for the HTTP/1.1 Tempory Redirect (307) status code.
New in jsoup 1.7.2 (Apr 11, 2013)
- Added support for supplementary characters outside of the Basic Multilingual Plane.
- Added support for structural pseudo CSS selectors, including :first-child, :last-child, :nth-child, :nth-last-child,
- :first-of-type, :last-of-type, :nth-of-type, :nth-last-of-type, :only-child, :only-of-type, :empty, and :root
- Added a maximum body response size to Jsoup.Connection, to prevent running out of memory when trying to read
- extremely large documents. The default is 1MB.
- Refactored the Cleaner to traverse rather than recurse child nodes, to avoid the risk of overflowing the stack.
- Added Element.insertChildren(), to easily insert a list of child nodes at a specific index.
- Added Node.childNodesCopy(), to create an independent copy of a Node's children.
- When parsing in XML mode, preserve XML declarations ().
- Introduced Parser.parseXmlFragment(), to allow easy parsing of XML fragments.
- Allow Whitelist test methods to be extended
- Added Document.OutputSettings.outline mode, to aid HTML debugging by printing out in outline mode, similar to browser HTML inspectors.
- When parsing, allow all tags to self-close. Tags that aren't expected to self-close will get an end tag.
- Fixed an issue when parsing /RCData tags containing unescaped closing tags that would drop the traling >.
- Corrected the javadoc for Element#child() to note that it throws IndexOutOfBounds.
- When cloning an Element, reset the classnames set so as not to hold a pointer to the source's.
- Limit how far up the stack the formatting adoption agency algorithm will travel, to prevent the chance of a run-away parse when the HTML stack is hopelessly deep.
- Modified Element.text() to build text by traversing child nodes rather than recursing. This avoids stack-overflow errors when the DOM is very deep and the VM stack-size is low.
New in jsoup 1.7.1 (Dec 19, 2012)
- Added a maximum body response size to Jsoup.Connection, to prevent running out of memory when trying to read extremely large documents.
New in jsoup 1.6.2 (Mar 28, 2012)
- Added a simplified XML parsing mode, which can usefully parse valid and invalid XML, but does not enforce any HTML document structure or special tag behaviour.
- Added the optional ability to track errors when tokenising and parsing.
- Added jsoup.connect.cookies(Map) method, to set multiple cookies at once, possibly from a prior request.
- Added Element.textNodes() and Element.dataNodes(), to easily access an element's children text nodes and data nodes.
- Added an example program that demonstrates how to format HTML as plain-text, and the use of the NodeVisitor interface.
- Added Node.traverse() and Elements.traverse() methods, to iterate through a node's descendants.
- Updated jsoup.connect so that when requests made as POSTs are redirected, the redirect is followed as a GET.
- Updated the Cleaner and whitelists to optionally preserve related links in elements, instead of converting them to absolute links.
- Updated the Cleaner to support custom allowed protocols such as "cid:" and "data:".
- Updated handling of tags, to act on only the first one seen when parsing, to align with modern browsers.
- Updated Node.setBaseUri(), to recursively set on all the node's descendants.
- Fixed handling of null characters within comments.
- Tweaked escaped entity detection in attributes to not treat &entity_... as an entity form.
- Fixed doctype tokeniser to allow whitespace between name and public identifier.
- Fixed issue where comments within a table tag would be duplicate-fostered into body.
- Fixed an issue where a spurious byte-order-mark at the start of a document would cause the parser to miss head contents.
- Fixed an issue where content after a frameset could cause a NPE crash. Now correctly implements spec and ignores the trailing content.
- Tweaked whitespace checks to align with HTML spec
- Tweaked HTML output of closing script and style tags to not add an extraneous newline when pretty-printing.
- Substantially reduced default memory allocation within Node.outerHtml, to reduce memory pressure when serialising smaller DOMs.
New in jsoup 1.2.3 (Aug 5, 2010)
- HTML5 support:
- While jsoup has always included implicit support for HTML5 tags, this release introduces explicit tag definitions. This ensures that when out-of-spec HTML5 is found (e.g. badly nested, or incorrectly parented), jsoup will create an in-spec parse tree.
- HTML5 Datasets are now supported with the Element.dataset() method that provides a convenient map view of an element's dataset.
- Improved international support
- When parsing HTML from a file or a URL, jsoup will now automatically detect the document's character set, and decode the input appropriately before parsing.
- You can also also define the document's output character set with the Document.outputSettings().charset(String) method. This controls which characters will be HTML escaped on output, and which will be kept as-is. The output charset defaults to the input charset.
- Other improvements and bug fixes:
- I've added two new selectors:
- namespace|element finds elements by tagname in a namespace
- [^attributePrefix] finds elements that have an attribute name starting with a prefix
- Also:
- Added support for namespaced elements () and selectors to find them (fb|name)
- Implemented the Node.ownerDocument() DOM method
- Improved implicit table element handling (particularly around thead, tbody, and tfoot).
- Improved HTML output format for empty elements and auto-detected self closing tags
- Changed DT & DD tags to block-mode tags, to follow practice over spec
- Added support for tag names with - and _ (, )
- Handle tags with internal trailing space ()
- Fixed support for character class regular expressions in the [attr=~regex] selector
New in jsoup 1.2.1 (Jul 1, 2010)
- Added .before(html) and .after(html) methods to Element and Elements, to insert sibling HTML
- Added :contains(text) selector, to search for elements containing the specified text
- Added :has(selector) pseudo-selector
- Added Element#parents and Elements#parents to retrieve an element's ancestor chain
- Fixes an issue where appending / prepending rows to a table (or to similar implicit element structures) would create a redundant wrapping elements
- Improved implicit close tag heuristic detection when parsing malformed HTML
- Fixes an issue where text content after a script (or other data-node) was incorrectly added to the data node.
- Fixes an issue where text order was incorrect when parsing pre-document HTML.