Heritrix is an open source flexible, robust, extensible, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accessible content.
Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits).
What's New in This Release: [ read full changelog ]
· List of classes is not present in select menu for DecideRules
· WARC metadata records should declare MIME-type 'application/warc-fields' (rather than 'text/anvl')
· bottleneck in StatisticsTracker.saveSourceStats?
· META http-equiv refresh content containing only a number misinterpreted as a URI