November 5th, 2009· Support for converting between character encodings through libiconv
· New parser utf8conv for converting almost any character encoding to utf8
· New parser entityconv, converts html entities such as ä to the
· corresponding utf-8 character
· The configuration system has been moved to a seperate library, libmethaconfig
· Various improvements to the configuration loader, such as dynamically adding
· and changing classes and scopes
· Lots of memory usage optimizations and cleanup fixes
· The documentation available in the wiki has been copied to a texinfo file,
· from now on all documentation will be put in this texinfo file and available
· as a manual both online and offline
· Support for filetype attributes. Parsers can now set custom data that will
· be associated with a parsed file. Attributes' primary area of use is when you
· are connected to a Methanol system and want to store meta-data about a URL.
· new Javascript function set_attribute() for setting attributes for the
· current URL
· API support for custom status, error/warning and target reporter functions
· lmetha_global_setopt() is no longer available, replaced with lmetha_setopt()
· options
· SpiderMonkey-1.8.0 support added
· New global Javascript function exec()
· New built-in handler function writefile
· libmetha no longer depends on libev, but instead uses pipes and epoll() for
· inter-thread communication and waiting for events on sockets.
· Added internal counters useful for keeping statistics
· New filetype option 'ignore_host'
· --external option set to false can no longer be circumvented using a HTTP-
· redirect
· Support for CURIE (why not?) in the built-in HTML parser added
· Bugfix, a syntax error would in some rare cases occur when parsing integer
· values in configuration files
· Bugfix in the configuration file parser when reading flag values
· Bugfix, when javascript filetype parsers did not return a value, it was
· treated as a string, "undefined", and used as a relative URL
February 24th, 2009· Bugfix, when external-peek was used the depth limit was messed up.
· Memory usage cleanup fixes
· dynamic-url option is no longer set to lookup by default, since it slows down the crawling significantly
· Build system now creates and installs some header files that modules can use when linking
· metha-config tool added
· lmm_mysql moved outside of this package
January 16th, 2009· Support for reading intial buffer from stdin
· --type and --base-url command line options added, along with the initial_filetype option in configuration files
· Cookies and DNS info is now properly shared between workers when running multithreaded
· Added some example usage commands to --examples
· Big improvements to the inter-thread communication, now faster and more organized
· Added support for 'init' functions to scripts. Read more about init functions at http://bithack.se/projects/methabot/docs/e4x/init_functions.html
· libmetha doesn't freeze when doing multiple concurrent HTTP HEAD requests anymore. The reason for the freezes was a bug in libcurl which is now fixed. Some workarounds have been added to libmetha to prevent the freezes from occuring when using the defect libcurl versions aswell.
· Support for older libcurl versions 7.17.x and 7.16.x
· New information is available in the "this" object of javascript parsers, content-type and transfer status code. Read more at http://bithack.se/projects/methabot/docs/e4x/this.html
· --verbose option replaced with --silent, since verbose mode is now default
· Initial support for FTP crawling and the ftp_dir_url crawler option
· Depth limiting is now crawler-specific
· Added the command line options --crawler and --filetype
· Support for extending and overriding already defined crawlers and filetypes
· Support for the copy keyword in configuration files
· Support for dynamically switching the active crawler, this lets you crawl different websites in completely different ways in one crawling session. Read more about crawler switching at http://bithack.se/projects/methabot/docs/crawler_switching.html
· libev version upgrade to 3.51
· The include directive in configuration files now makes sure the included configuration file hasn't already been loaded, to prevent include-loops and multiple filetype/crawler definitions.
· Various SpiderMonkey garbage collection fixes, libmetha does not crash anymore when cleaning up after a multithreaded session
· Added some extra information to the --info option
· The 'external' option is now fixed and enabled again
· New option --spread-workers
· New libmetha API function lmetha_global_setopt() allows changing the global error/message/warning reporter
· Added initial implementation of a test suite for developers
· Better error reporting when loading configuration files
· Bugfix when an HTTP server didn't return a Content-Type header after a HEAD request
· Bugfix when sorting URLs after multiple HTTP HEAD requests
· Bugfix in the html to xml converter when the HTML page did not have an tag
· Bugfix, the extless-url option did not work
· Bugfix, html to xml converter no longer chokes on byte-order marks or other text before the actual HTML
· Bugfix, prevented libmetha from trying to access URLs of protocols that are not supported
· Bugfix when shutting down after an error.
· Bugfix, unresolvable URLs did not break out the retry loop after three retries
· Very experimental and unstable support for Win32, mainly intended for developers
New configuration files:
· google.conf, to perform google searches
· youtube.conf, youtube searching
· meta.conf, prints meta information such as keywords and description about HTML pages
· title.conf, prints the title of HTML pages
· ftp.conf, for crawling FTP servers
December 28th, 2008· Completely new architectural design
· Filetype parser scripting through Javascript/E4X
· Multithreading is now a primary concept
· HTTP HEAD requests are now done asynchronously in a separate thread using curl and libev
· Support for "peeking" at external URLs
· The Methabot Project has been split up into several subprojects, primarily there's the command line tool, which uses the web crawling library libmetha as its backend.
· Initial work on the distributed web crawling system Methanol.