What's new in Norconex HTTP Collector 2.6.0

Oct 26, 2016
  • New normalization rule for GenericURLNormalizer: removeTrailingSlash. Can now specify "notFoundStatusCodes" on GenericMetadataFetcher.
  • Specifying an empty "path" tag in XML config or, setting a null or empty string array on StandardSitemapResolverFactory#setSitemapPaths(...) method will now prevent trying to locate sitemaps using default locations and will strictly rely on sitemap URLs specified as start URLs or found in robots.txt (if enabled).
  • GenericLinkExtractor no longer extracts URL from HTML/XML comments by default. To re-enable this behavior a new "setCommentsEnabled(boolean)" method has been added.
  • Normalization rule "addTrailingSlash" in GenericURLNormalizer has been renamed "addDirectoryTrailingSlash".
  • SitemapStore now uses MVStore instead of MapDB for storing the list of processed sitemaps.
  • Referrer data is now always stored for GenericLinkExtractor (default) and TikaLinkExtractor.
  • Redirects encountered by IHttpMetadataFetcher implementations are now followed by default.
  • Dependency updates: Norconex Collector Core 1.6.0.
  • Renamed constant HttpMetadata.COLLECTOR_REFERNCED_URLS to HttpMetadata.COLLECTOR_REFERENCED_URLS (was misspelled).
  • API break: method signature changed for IHttpMetadataFetcher from Properties fetchHTTPHeaders(HttpClient httpClient, String url) to HttpFetchResponse fetchHTTPHeaders( HttpClient httpClient, String url, Properties headers)
  • The subject for the crawler event HttpCrawlerEvent.DOCUMENT_METADATA_FETCHED is now an instance of HttpFetchResponse.
  • Extracted canonical URLs now have their referrer reference stored with them. Extracted URLs are now stored in crawl store. Now using MVStore for sitemap store (instead of MapDB).
  • Fixed GenericRecrawlableResolver#MinFrequency() constructor with arguments not setting patterns correctly.
  • Fixed documents wrongfully being considered orphans when referrer was skipped for being unmodified or premature, or was in temporary error, before its URLs could be extracted. This could cause valid documents to be deleted/ignored (depending on orphan strategy used). Now those "child" URLs will be queued for processing as if they were extracted from referrer page.
  • Canonical URLs extracted are now normalized before being compared to their containing page URL, which were already normalized.

New in Norconex HTTP Collector 2.2.0 (Jul 23, 2015)

  • Added support for canonical links defined in both HTTP Headers or as a link tag in an HTML document head tag. Canonical links detection is always performed unless explicitly disabled. #79
  • New URLStatusCrawlerEventListener class for producing reports of fetched URLs and their status. Useful for finding broken links or else.
  • Added three new configuration options to GenericHttpClientFactory to better deal with HTTP connectivity issues (like timeouts): "maxConnectionsPerRoute", "maxConnectionIdleTime", and "maxConnectionInactiveTime". #118
  • New LastModifiedMetadataChecksummer that uses Last-Modified HTTP header value for checksum purposes, replacing HttpMetadataChecksummer as the default implementation. For choosing one or more fields of your choice to create a checksum, you can now use the new GenericMetadataChecksummer from the Collector Core dependency.
  • New CurrentDateTagger, DateMetadataFilter, NumericMetadataFilter, TextPatternTagger, GenericSpoiledReferenceStrategizer and more new features introduced by dependency upgrades.
  • New method GenericDocumentFetcher#setNotFoundStatusCodes(int...) to specify one or several custom "Not Found" HTTP codes. Default is 404.
  • GenericHttpClientFactory default maximum connection was increased from 20 to 200 and default maximum connections per route was increased from 2 to 20. #118
  • New HttpFetchResponse class now passed to crawl event listeners after a document fetch instead of the IHttpDocumentFetcher used. This adds the ability to listen for specific HTTP response status code. As a consequence, IHttpDocumentFetcher now returns a HttpFetchResponse.
  • HttpMetadataChecksummer has been deprecated in favor of LastModifiedMetadataChecksummer.
  • HtmlLinkExtractor now supports specifying tags without an attribute for detecting URLs.
  • HtmlLinkExtractor now ignores whatever is found between "script" tags so that JavaScript-generated URLs can no longer cause trouble. #119
  • Maven dependency updates: Norconex Collector Core 1.2.0, Joda Time 2.8.1, Apache HTTP Client 4.5, Jetty Webapp 9.2.11.v20150529, Apache Ant 1.9.5.
  • Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
  • Improved/fixed javadoc.
  • HttpCrawlState#NOT_FOUND was migrated to Norconex Collector Core CrawlState#NOT_FOUND.
  • Fixed HTML documents being skipped when HtmlLinkExtractor found a URL of invalid format. Now a warning is thrown for each bad URLs instead and the document is processed anyway, and good URLs are extracted. #119
  • Fixed MongoDB stage/depth compound index. #97
  • Fixed MongoCrawlDataSerializer "referrerLinkText" link attribute/metadata having the same value as "referrerLinkTag". #82
  • HtmlLinkExtractor now decodes encoded ampersands present in URLs. #88
  • Both HtmlLinkExtractor and TikaLinkExtractor no longer extract empty href links. Added this use case to corresponding unit test. #87
  • HtmlLinkExtractor now strips leading spaces.
  • Fixed "trustAllSSLCertificates" configuration option on GenericHttpClientFactory not being recognized in XML config. #100
  • Fixed exceptions thrown in StandardRobotsTxtProvider when robots.txt contained rules ending with ? or when the referrer URL was starting with a space.
  • Several invalid characters are now supported in URLs (e.g., commas).
  • Fixed GenericDelayResolver not saving in XML properly and fixed its javadoc as well, which did not mention how to pass the delay in XML config.
  • TargetURLRedirectStrategy no longer throws an exception when redirects are disabled for a request. #124
  • Fixed "sitemapLocations" and "lenient" not being applied to StandardSitemapFactory.

New in Norconex HTTP Collector 2.1.0 (Apr 9, 2015)

  • Several new features, updates and fixes were added by upgrading Norconex Collector Core (http://www.norconex.com/collectors/collector-core/) and Norconex Importer (http://www.norconex.com/collectors/importer/) dependencies. Those include support for ORC, translation, a title generator, new content type parsing, and more. Refer to dependency release notes for more details.
  • New methods and configuration attribute to disable checksum creation in HttpMetadataChecksummer.
  • Sitemap resolving pipeline stage is now always invoked (but won't do anything if disabled).
  • Library updates: Norconex Collector Core 1.1.0, JUnit 4.12, Joda-Time 2.7, Apache HTTP Components 4.4, Jetty Webapp 9.2.10.v20150310, Fongo 1.6.2.
  • Added Sonatype repository to pom.xml for snapshot releases.
  • Updated several maven plugins and added SonarQube maven plugin.
  • Improvements on character encoding detection from HTTP headers.
  • log4j.properties from ./classes/ now properly loaded by collector-http.sh (github #59 ).
  • Improved javadoc.
  • Added many unit tests for testing start vs resume vs stop vs deleted vs modified, in different JVM instances.
  • "minimum" and "complex" configuration examples now ignore sitemap.xml files.
  • Fixed link extractor not fetching link text properly when keepReferrerData is true on HtmlLinkExtractor (github #56 ).
  • Robot meta data found in HTML pages will no longer be extracted if found within an HTML comment. Robot metadata detection is also more robust (github #60 ).
  • Fixed NPE in HttpImporterPipelineUtil#enhanceHTTPHeaders when content type from HTTP header is not defined.
  • Fixed log4j log levels incorrectly ending with a semi-colon.

New in Norconex HTTP Collector 2.1.0 Dev (Mar 3, 2015)

  • Fixed link extractor not fetching link text properly when keepReferrerData is true on HtmlLinkExtractor (github #56 ).
  • Upgraded Norconex Collector Core to 1.1.0.
  • Added Sonatype repository to pom.xml for snapshot releases.
  • Improvements on character encoding detection from HTTP headers.
  • log4j.properties from ./classes/ now properly loaded by collector-http.sh (github #59 ).

New in Norconex HTTP Collector 2.0.2 (Feb 5, 2015)

  • Fixed the collector "stop" action having no effect (github #49).
  • Fixed crawl data wrongfully applied as metadata after the import phase.
  • Fixed NullPointerException when sitemap support is disabled.
  • Fixed incorrect deletion behavior for embedded orphan documents.
  • Improved log4j.properties logging options for crawler events.
  • Upgraded Norconex Collector Core dependency to 1.0.2.

New in Norconex HTTP Collector 2.0.1 (Dec 4, 2014)

  • From collector-core-1.0.1: When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions.
  • Fixed errors in example configuration files.

New in Norconex HTTP Collector 2.0.0 (Nov 28, 2014)

  • Upgraded Norconex Importer to version 2.0.0, which brings to Norconex HTTP Collector a lot of new features, such as: Document content splitting, splitting of embedded documents into individual documents, new taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more. Please read the Norconex Importer release notes for a complete list of changes at: http://www.norconex.com/product/importer/changes-report.html#a2.0.0
  • Can now supplied a "pathsFile" as part of the startPaths, acting as a seed list.
  • New fast MVStore database implementation for URL database (from Norconex Collector Core).
  • New H2 database implementation for URL database (crawl data store).
  • Now keeps track of parent references (for embedded/split documents).
  • More unit tests, with the addition of an embedded Jetty Web server rendering test pages for some unit tests.
  • New JMX/MBean support added on crawlers.
  • IUrlExtractor is now ILinkExtractor and both their implementing classes (HtmlLinkExtractor and TikaLinkExtractor) now support also extracting a link title and text (github #23 ), and they also support the "nofollow" robot rule.
  • It is now possible to configure multiple link extraction classes, each taking effect on particular URLs and/or content-types.
  • IHtmlLinkExtractor can be configured to use specified HTML tags and attributes to find URLs.
  • Now licensed under The Apache License, Version 2.0.
  • Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
  • The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
  • Refactored to use the new Norconex Collector Core library. A significant portion of the Norconex HTTP Collector code has been moved to that core library. Some of the moved classes are (base package being com.norconex.collector.*): http.checksum.IHttpDocumentChecksummer to core.checksum.IDocumentChecksummer, http.checksum.IHttpHeadersChecksummer to core.checksum.IMetadataChecksummer, http.checksum.DefaultHttpDocumentChecksummer to core.checksum.impl.MD5DocumentChecksummer, http.filter.IURLFilter to core.filter.IReferenceFilter http.filter.IHttpHeadersFilter to core.filter.IMetadataFilter http.filter.IHttpDocumentFilter to core.filter.IDocumentFilter http.filter.impl.ExtensionURLFilter to core.filter.impl.ExtensionURLFilter http.filter.impl.RegexHeaderFilter to core.filter.impl.RegexMetadataFilter http.filter.impl.RegexURLFilter to core.filter.impl.RegexReferenceFilter
  • Amongst others, the following classes were renamed (within com.norconex.collector.http.*): checksum.impl.DefaultHttpHeadersChecksummer to checksum.impl.HttpMetadataChecksummer, client.impl.DefaultHttpClientFactory to client.impl.GenericHttpClientFactory, delay.impl.DefaultDelayResolver to delay.impl.GenericDelayResolver, fetch.impl.DefaultDocumentFetcher to fetch.impl.GenericDocumentFetcher, fetch.impl.SimpleHttpHeadersFetcher to fetch.impl.GenericHttpHeadersFetcher, robot.impl.DefaultRobotsMetaProvider to robot.impl.StandardRobotsMetaProvider, robot.impl.DefaultRobotsTxtProvider to robot.impl.StandardRobotsTxtProvider, sitemap.impl.DefaultSitemapResolver to sitemap.impl.StandardSitemapResolver, url.impl.DefaultURLExtractor to url.impl.GenericURLExtractor
  • Several references to "url" were changed to "reference".
  • New and more scalable crawler event model along with new listeners.
  • Refactored to use JEF 4.0.0 which makes the HTTP Collector easier to monitor.
  • Other libray upgrades: Norconex Committer to 2.0.0 and Norconex Commons Lang to 1.5.0.
  • Removed previously deprecated classes.
  • Crawled sitemap details now has its own store (no longer mixed with the crawl data store).
  • ISiteMapResolver now needs an ISiteMapResolverFactory.
  • Sitemap resolution now stops when a stop request is ussued (github #38 ).
  • Now checks if crawler is running before attempting to stop it (github #37 ).

New in Norconex HTTP Collector 1.3.4 (Aug 26, 2014)

  • MongoCrawlURLDatabase now supports user authentication.
  • Now requires Java 7 or higher.
  • Fixed DefaultRobotsTxtProvider failing to parse some robots.txt patterns (github #36).

New in Norconex HTTP Collector 1.3.3 (Aug 8, 2014)

  • Upgraded JEF to 3.0.1 to fix stop action not working.
  • Fixed NullPointerException in robots.txt resolution under some circonstances.

New in Norconex HTTP Collector 1.3.2 (Jun 25, 2014)

  • DefaultURLExtractor no longer treat empty href as being a URL ending with a double-quote.
  • Renamed HttpMetadata key "collector.http.dept" to "collector.http.depth" (typo fix).
  • Upgraded Norconex Commons Lang to 1.3.2
  • GenericURLNormallizer no longer rejects URLs with spaces in them. It now logs a warning instead (thanks to Norconex Commons Lang upgrade).

New in Norconex HTTP Collector 1.3.1 (Apr 15, 2014)

  • Header and document checksum value are no longer added by default to prevent the issue described in github ticket #24. Instead, adding checksum is now an optional feature of DefaultHttpDocumentChecksummer and DefaultHttpHeadersChecksummer.

New in Norconex HTTP Collector 1.3.0 (Mar 25, 2014)

  • Now supports NTLM authentication. SPNEGO and Kerberos were also added but are experimental (see DefaultHttpClientFactory).
  • Can now specify character set of HTTP connections and authentication forms.
  • Can now set custom timeout values on HTTP connection-related activities.
  • New option to trust all SSL certificates of sites being crawled (see DefaultHttpClientFactory).
  • Can now specify a maximum number of HTTP connections for each crawler independently of configured number of threads (see DefaultHttpClientFactory).
  • DefaultHttpClientFactory introduces additional configuration options: proxy scheme, 'Expect: 100-continue' handshake, maximum HTTP redirects, local address, stale connection checks.
  • HTTP header checksum and document checksum are now added to the document metadata as HttpMetadata#CHECKSUM_HEADER and HttpMetadata#CHECKSUM_DOC.
  • The empty sub-folders contained under the "download" folder are now periodically deleted. This speeds up directory scanning and increases performance on large crawls.
  • The userAgent is now a crawler configuration option (previously was an option of DefaultHttpClientInitializer )
  • API change: IRobotsTxtProvider#getRobotsTxt(...) method signature has changed to accept the User-Agent.
  • IHttpClientInitializer is now deprecated in favor of IHttpClientFactory, giving you more control over HttpClient creation.
  • API change: Methods previously accepting DefaultHttpClient instances now have their signature accepting parent interface HttpClient instead.
  • More logging to help resolve crawler issues with DEBUG log level.
  • HttpCrawler more lenient upon encountering some errors that were previous aborting entire execution.
  • Library upgrades. Updated default crawl url database (MapDB) to version 0.9.10, Norconex Commons Lang to 1.3.0, Norconex Committer to 1.2.0, Norconex Importer to 1.2.0, and Apache HttpClient to 4.3.2.
  • Now ensures that robots.txt agent matching gives priority to the most specific match (as opposed to the first match). Sitemaps detected in robots.txt are also preserved for sitemap resolving.
  • Removed classes deprecated since 1.1.

New in Norconex HTTP Collector 1.2.0 (Jan 11, 2014)

  • New optional Mongo URL Database implementation.
  • New TikaURLExtractor class providing an alternate IURLExtractor implementation based on Apache Tika HTMLParser.
  • New SegmentCountURLFilter class for filtering URLs having a specified number of segments (can check duplicate segments too).
  • New unit tests.
  • MapDB URL Database classes moved to its own "mapdb" package. DefaultCrawlURLDatabaseFactory still exists, but is just a pointer to MapDBCrawlURLDatabaseFactory.
  • Example configurations now point to Norconex test pages to ensure their stability.
  • Upgraded dependent libraries: Norconex Committer 1.1.0, Norconex Commons Lang 1.2.0, MapDB 0.9.8 and other thrid party libraries.
  • Improved Javadoc.

New in Norconex HTTP Collector 1.1.1 (Oct 3, 2013)

  • Fixed not being able to extract the "href" attribute when it starts a new line.
  • Fixed HTTP redirects not storing final target URL but the source URL instead.
  • Upgraded dependent libraries to Norconex Importer 1.1.0 and Norconex Commons Lang 1.1.0.

New in Norconex HTTP Collector 1.1.0 (Aug 22, 2013)

  • Crawler now fires additional events. Added "documentRobotsMetaRejected"
  • and "documentImportRejected" methods to IHttpCrawlerEventListener.
  • DefaultCrawlURLDatabase now uses a MapDB-based implementation for
  • faster performance. The derby implementation has been kept for those
  • with a preference for it.
  • Now support sitemap.xml and sitemap index (plain or gzip).
  • URLs from sitemaps will have the sitemap information as metadata.
  • BASIC and DIGEST authentication now supported.
  • Now supports in-page robot instructions. Via "ROBOTS" meta tag first,
  • or "X-Robots-Tag" tag if present in HTTP header.
  • "ftp" protocol now supported.
  • It is now possible to specify the scope of each delay between URL
  • download with DefaultDelayResolver (per crawler, site, or thread).
  • "crawler" is default.
  • Javadoc Jar and Source Jar are now also deployed to Maven repository.
  • Deprecation of *.handler.* package. Classes have been moved to
  • more intuitive packages.
  • IDelayResolver are no longer systematically synchronized
  • (i.e. accessible only one thread a a time). This is a decision left
  • up to each implementation.
  • Reduced the number of calls to the crawl database to improve
  • performance (URL filtering but be successfully passed for a document
  • to get queued for processing).

New in Norconex HTTP Collector 1.0.2 (Jul 12, 2013)

  • DefaultURLExtractor now handle URLs starting with ? properly and those prefixed with "URL=" (e.g. meta http-equiv="refresh")

New in Norconex HTTP Collector 1.0.1 (Jul 10, 2013)

  • Bug fix release.