hamake is a free and open source utility that will allow you to automate incremental processing of datasets stored on HDFS using Hadoop tasks written in Java or using PigLatin scripts.
Datasets could be either individual files or directories containing groups of files. New files may be added (or removed) at arbitrary location which may trigger recalculation of data depending on them. It is similar to unix 'make' utility.
First, you formulate you processing model in terms of data locations (which could be used either as inputs or outputs) and tasks.
Currently two types of tasks supported (although they are called "map" and "reduce" but they should not be confused with Hadoop "map" and "reduce"):
MAP - this a type of task which maps a group of files at one location to another location(s). This task assumes 1 to 1 file mapping between locations, and can process them incrementally, converting only files which are present at source location, but not at all of destinations.
· Python 2.5 or later
· Hadoop 0.18.3 or later
· HadooopThriftServer (available on the downloads page)