Heritrix is an open source flexible, robust, extensible, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accessible content.
Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits).
What's New in This Release: [ read full changelog ]
Bug:
· List of classes is not present in select menu for DecideRules
· WARC metadata records should declare MIME-type 'application/warc-fields' (rather than 'text/anvl')
· bottleneck in StatisticsTracker.saveSourceStats?
· META http-equiv refresh content containing only a number misinterpreted as a URI
Improvement:
· ${HOSTNAME} in arc suffix is only replaced completely
· update to BDB-JE 3.3.74
· Update 'public suffix list' (effective_tld_names.dat)