Heritrix Changelog

What's new in Heritrix 3.2.0

Mar 27, 2015
  • Minor version number bumped because there are some changes that are not backward-compatible. For one, heritrix now requires java 1.6. It's also been a while since the last release, and there some some significant new features, and lots of other changes.
  • URL-agnostic deduplication (HER-2022) - Deduplication for identical content, even at different urls. Configuration is documented here: Duplication Reduction Processors
  • Automatic form login (HER-2031) - With new processors ExtractorHTMLForms and FormLoginProcessor, heritrix can detect typical html login forms and submit a configured username and password to them. Configuration is documented at http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/forms/ExtractorHTMLForms.html
  • Option to forget all but latest checkpoint (HER-2051) - Enable periodic checkpointing without worrying that it will overflow the disk. When enabled, earlier checkpointed logs are rolled up into new checkpointed log, earlier bdb checkpoint dirs are deleted and earlier checkpoint dirs are deleted. Related is HER-2056, option not to rollover warcs on checkpoint.
  • Handle stats consistently on checkpoint (HER-2048) - On checkpoint resumption, all stats resume counting from where they left off when checkpointed. In addition, dns lookups are not repeated in a checkpoint-resumed crawl (unless they're expired, same as in a normal crawl).
  • Custom extractor that constructs outlinks from strings found in content (HER-2024) - Very flexible extractor module ExtractorMultipleRegex, which can look for arbitrary regular expressions in the url and content of a page, and construct urls from the matching groups. http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/extractor/ExtractorMultipleRegex.html
  • Responsive web UI using freemarker templates (HER-1726) - Web UI uses freemarker templates to replace hard-coded html in java source, and foundation for responsiveness.
  • Improved speculative link extraction from javascript (HER-1523) - Significant tightening of speculative link extraction heuristics based on statistical analysis.

New in Heritrix 3.0.0 (Apr 12, 2010)

  • List of classes is not present in select menu for DecideRules
  • WARC metadata records should declare MIME-type 'application/warc-fields' (rather than 'text/anvl')
  • bottleneck in StatisticsTracker.saveSourceStats?
  • META http-equiv refresh content containing only a number misinterpreted as a URI

New in Heritrix 2.0.2 (Mar 5, 2009)

  • Bug:
  • List of classes is not present in select menu for DecideRules
  • WARC metadata records should declare MIME-type 'application/warc-fields' (rather than 'text/anvl')
  • bottleneck in StatisticsTracker.saveSourceStats?
  • META http-equiv refresh content containing only a number misinterpreted as a URI
  • Improvement:
  • ${HOSTNAME} in arc suffix is only replaced completely
  • update to BDB-JE 3.3.74
  • Update 'public suffix list' (effective_tld_names.dat)