Norconex HTTP Collector Changelog

What's new in Norconex HTTP Collector 2.6.0

Oct 26, 2016

New normalization rule for GenericURLNormalizer: removeTrailingSlash. Can now specify "notFoundStatusCodes" on GenericMetadataFetcher.
Specifying an empty "path" tag in XML config or, setting a null or empty string array on StandardSitemapResolverFactory#setSitemapPaths(...) method will now prevent trying to locate sitemaps using default locations and will strictly rely on sitemap URLs specified as start URLs or found in robots.txt (if enabled).
GenericLinkExtractor no longer extracts URL from HTML/XML comments by default. To re-enable this behavior a new "setCommentsEnabled(boolean)" method has been added.
Normalization rule "addTrailingSlash" in GenericURLNormalizer has been renamed "addDirectoryTrailingSlash".
SitemapStore now uses MVStore instead of MapDB for storing the list of processed sitemaps.
Referrer data is now always stored for GenericLinkExtractor (default) and TikaLinkExtractor.
Redirects encountered by IHttpMetadataFetcher implementations are now followed by default.
Dependency updates: Norconex Collector Core 1.6.0.
Renamed constant HttpMetadata.COLLECTOR_REFERNCED_URLS to HttpMetadata.COLLECTOR_REFERENCED_URLS (was misspelled).
API break: method signature changed for IHttpMetadataFetcher from Properties fetchHTTPHeaders(HttpClient httpClient, String url) to HttpFetchResponse fetchHTTPHeaders( HttpClient httpClient, String url, Properties headers)
The subject for the crawler event HttpCrawlerEvent.DOCUMENT_METADATA_FETCHED is now an instance of HttpFetchResponse.
Extracted canonical URLs now have their referrer reference stored with them. Extracted URLs are now stored in crawl store. Now using MVStore for sitemap store (instead of MapDB).
Fixed GenericRecrawlableResolver#MinFrequency() constructor with arguments not setting patterns correctly.
Fixed documents wrongfully being considered orphans when referrer was skipped for being unmodified or premature, or was in temporary error, before its URLs could be extracted. This could cause valid documents to be deleted/ignored (depending on orphan strategy used). Now those "child" URLs will be queued for processing as if they were extracted from referrer page.
Canonical URLs extracted are now normalized before being compared to their containing page URL, which were already normalized.

New in Norconex HTTP Collector 2.2.0 (Jul 23, 2015)

Added support for canonical links defined in both HTTP Headers or as a link tag in an HTML document head tag. Canonical links detection is always performed unless explicitly disabled. #79
New URLStatusCrawlerEventListener class for producing reports of fetched URLs and their status. Useful for finding broken links or else.
Added three new configuration options to GenericHttpClientFactory to better deal with HTTP connectivity issues (like timeouts): "maxConnectionsPerRoute", "maxConnectionIdleTime", and "maxConnectionInactiveTime". #118
New LastModifiedMetadataChecksummer that uses Last-Modified HTTP header value for checksum purposes, replacing HttpMetadataChecksummer as the default implementation. For choosing one or more fields of your choice to create a checksum, you can now use the new GenericMetadataChecksummer from the Collector Core dependency.
New CurrentDateTagger, DateMetadataFilter, NumericMetadataFilter, TextPatternTagger, GenericSpoiledReferenceStrategizer and more new features introduced by dependency upgrades.
New method GenericDocumentFetcher#setNotFoundStatusCodes(int...) to specify one or several custom "Not Found" HTTP codes. Default is 404.
GenericHttpClientFactory default maximum connection was increased from 20 to 200 and default maximum connections per route was increased from 2 to 20. #118
New HttpFetchResponse class now passed to crawl event listeners after a document fetch instead of the IHttpDocumentFetcher used. This adds the ability to listen for specific HTTP response status code. As a consequence, IHttpDocumentFetcher now returns a HttpFetchResponse.
HttpMetadataChecksummer has been deprecated in favor of LastModifiedMetadataChecksummer.
HtmlLinkExtractor now supports specifying tags without an attribute for detecting URLs.
HtmlLinkExtractor now ignores whatever is found between "script" tags so that JavaScript-generated URLs can no longer cause trouble. #119
Maven dependency updates: Norconex Collector Core 1.2.0, Joda Time 2.8.1, Apache HTTP Client 4.5, Jetty Webapp 9.2.11.v20150529, Apache Ant 1.9.5.
Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Improved/fixed javadoc.
HttpCrawlState#NOT_FOUND was migrated to Norconex Collector Core CrawlState#NOT_FOUND.
Fixed HTML documents being skipped when HtmlLinkExtractor found a URL of invalid format. Now a warning is thrown for each bad URLs instead and the document is processed anyway, and good URLs are extracted. #119
Fixed MongoDB stage/depth compound index. #97
Fixed MongoCrawlDataSerializer "referrerLinkText" link attribute/metadata having the same value as "referrerLinkTag". #82
HtmlLinkExtractor now decodes encoded ampersands present in URLs. #88
Both HtmlLinkExtractor and TikaLinkExtractor no longer extract empty href links. Added this use case to corresponding unit test. #87
HtmlLinkExtractor now strips leading spaces.
Fixed "trustAllSSLCertificates" configuration option on GenericHttpClientFactory not being recognized in XML config. #100
Fixed exceptions thrown in StandardRobotsTxtProvider when robots.txt contained rules ending with ? or when the referrer URL was starting with a space.
Several invalid characters are now supported in URLs (e.g., commas).
Fixed GenericDelayResolver not saving in XML properly and fixed its javadoc as well, which did not mention how to pass the delay in XML config.
TargetURLRedirectStrategy no longer throws an exception when redirects are disabled for a request. #124
Fixed "sitemapLocations" and "lenient" not being applied to StandardSitemapFactory.

New in Norconex HTTP Collector 2.1.0 (Apr 9, 2015)

New in Norconex HTTP Collector 2.1.0 Dev (Mar 3, 2015)

New in Norconex HTTP Collector 2.0.2 (Feb 5, 2015)

New in Norconex HTTP Collector 2.0.1 (Dec 4, 2014)

New in Norconex HTTP Collector 2.0.0 (Nov 28, 2014)

Upgraded Norconex Importer to version 2.0.0, which brings to Norconex HTTP Collector a lot of new features, such as: Document content splitting, splitting of embedded documents into individual documents, new taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more. Please read the Norconex Importer release notes for a complete list of changes at: http://www.norconex.com/product/importer/changes-report.html#a2.0.0
Can now supplied a "pathsFile" as part of the startPaths, acting as a seed list.
New fast MVStore database implementation for URL database (from Norconex Collector Core).
New H2 database implementation for URL database (crawl data store).
Now keeps track of parent references (for embedded/split documents).
More unit tests, with the addition of an embedded Jetty Web server rendering test pages for some unit tests.
New JMX/MBean support added on crawlers.
IUrlExtractor is now ILinkExtractor and both their implementing classes (HtmlLinkExtractor and TikaLinkExtractor) now support also extracting a link title and text (github #23 ), and they also support the "nofollow" robot rule.
It is now possible to configure multiple link extraction classes, each taking effect on particular URLs and/or content-types.
IHtmlLinkExtractor can be configured to use specified HTML tags and attributes to find URLs.
Now licensed under The Apache License, Version 2.0.
Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
Refactored to use the new Norconex Collector Core library. A significant portion of the Norconex HTTP Collector code has been moved to that core library. Some of the moved classes are (base package being com.norconex.collector.*): http.checksum.IHttpDocumentChecksummer to core.checksum.IDocumentChecksummer, http.checksum.IHttpHeadersChecksummer to core.checksum.IMetadataChecksummer, http.checksum.DefaultHttpDocumentChecksummer to core.checksum.impl.MD5DocumentChecksummer, http.filter.IURLFilter to core.filter.IReferenceFilter http.filter.IHttpHeadersFilter to core.filter.IMetadataFilter http.filter.IHttpDocumentFilter to core.filter.IDocumentFilter http.filter.impl.ExtensionURLFilter to core.filter.impl.ExtensionURLFilter http.filter.impl.RegexHeaderFilter to core.filter.impl.RegexMetadataFilter http.filter.impl.RegexURLFilter to core.filter.impl.RegexReferenceFilter
Amongst others, the following classes were renamed (within com.norconex.collector.http.*): checksum.impl.DefaultHttpHeadersChecksummer to checksum.impl.HttpMetadataChecksummer, client.impl.DefaultHttpClientFactory to client.impl.GenericHttpClientFactory, delay.impl.DefaultDelayResolver to delay.impl.GenericDelayResolver, fetch.impl.DefaultDocumentFetcher to fetch.impl.GenericDocumentFetcher, fetch.impl.SimpleHttpHeadersFetcher to fetch.impl.GenericHttpHeadersFetcher, robot.impl.DefaultRobotsMetaProvider to robot.impl.StandardRobotsMetaProvider, robot.impl.DefaultRobotsTxtProvider to robot.impl.StandardRobotsTxtProvider, sitemap.impl.DefaultSitemapResolver to sitemap.impl.StandardSitemapResolver, url.impl.DefaultURLExtractor to url.impl.GenericURLExtractor
Several references to "url" were changed to "reference".
New and more scalable crawler event model along with new listeners.
Refactored to use JEF 4.0.0 which makes the HTTP Collector easier to monitor.
Other libray upgrades: Norconex Committer to 2.0.0 and Norconex Commons Lang to 1.5.0.
Removed previously deprecated classes.
Crawled sitemap details now has its own store (no longer mixed with the crawl data store).
ISiteMapResolver now needs an ISiteMapResolverFactory.
Sitemap resolution now stops when a stop request is ussued (github #38 ).
Now checks if crawler is running before attempting to stop it (github #37 ).

New in Norconex HTTP Collector 1.3.4 (Aug 26, 2014)

New in Norconex HTTP Collector 1.3.3 (Aug 8, 2014)

New in Norconex HTTP Collector 1.3.2 (Jun 25, 2014)

New in Norconex HTTP Collector 1.3.1 (Apr 15, 2014)

New in Norconex HTTP Collector 1.3.0 (Mar 25, 2014)

Now supports NTLM authentication. SPNEGO and Kerberos were also added but are experimental (see DefaultHttpClientFactory).
Can now specify character set of HTTP connections and authentication forms.
Can now set custom timeout values on HTTP connection-related activities.
New option to trust all SSL certificates of sites being crawled (see DefaultHttpClientFactory).
Can now specify a maximum number of HTTP connections for each crawler independently of configured number of threads (see DefaultHttpClientFactory).
DefaultHttpClientFactory introduces additional configuration options: proxy scheme, 'Expect: 100-continue' handshake, maximum HTTP redirects, local address, stale connection checks.
HTTP header checksum and document checksum are now added to the document metadata as HttpMetadata#CHECKSUM_HEADER and HttpMetadata#CHECKSUM_DOC.
The empty sub-folders contained under the "download" folder are now periodically deleted. This speeds up directory scanning and increases performance on large crawls.
The userAgent is now a crawler configuration option (previously was an option of DefaultHttpClientInitializer )
API change: IRobotsTxtProvider#getRobotsTxt(...) method signature has changed to accept the User-Agent.
IHttpClientInitializer is now deprecated in favor of IHttpClientFactory, giving you more control over HttpClient creation.
API change: Methods previously accepting DefaultHttpClient instances now have their signature accepting parent interface HttpClient instead.
More logging to help resolve crawler issues with DEBUG log level.
HttpCrawler more lenient upon encountering some errors that were previous aborting entire execution.
Library upgrades. Updated default crawl url database (MapDB) to version 0.9.10, Norconex Commons Lang to 1.3.0, Norconex Committer to 1.2.0, Norconex Importer to 1.2.0, and Apache HttpClient to 4.3.2.
Now ensures that robots.txt agent matching gives priority to the most specific match (as opposed to the first match). Sitemaps detected in robots.txt are also preserved for sitemap resolving.
Removed classes deprecated since 1.1.

Norconex HTTP Collector Changelog

What's new in Norconex HTTP Collector 2.6.0

New in Norconex HTTP Collector 2.2.0 (Jul 23, 2015)

New in Norconex HTTP Collector 2.1.0 (Apr 9, 2015)

New in Norconex HTTP Collector 2.1.0 Dev (Mar 3, 2015)

New in Norconex HTTP Collector 2.0.2 (Feb 5, 2015)

New in Norconex HTTP Collector 2.0.1 (Dec 4, 2014)

New in Norconex HTTP Collector 2.0.0 (Nov 28, 2014)

New in Norconex HTTP Collector 1.3.4 (Aug 26, 2014)

New in Norconex HTTP Collector 1.3.3 (Aug 8, 2014)

New in Norconex HTTP Collector 1.3.2 (Jun 25, 2014)

New in Norconex HTTP Collector 1.3.1 (Apr 15, 2014)

New in Norconex HTTP Collector 1.3.0 (Mar 25, 2014)

New in Norconex HTTP Collector 1.2.0 (Jan 11, 2014)

New in Norconex HTTP Collector 1.1.1 (Oct 3, 2013)

New in Norconex HTTP Collector 1.1.0 (Aug 22, 2013)

New in Norconex HTTP Collector 1.0.2 (Jul 12, 2013)

New in Norconex HTTP Collector 1.0.1 (Jul 10, 2013)