Duke 1.2

A fast deduplication engine
Duke is a small, free, easy to use, fast and flexible deduplication (entity resolution or record linkage) engine written in Java on top of Lucene.

At the moment Duke can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.

Duke can be used to find duplicate records inside a single table/data source, or it can be used to find records in different tables/sources which most likely represent the same real-world entity.

Duke is written in the Java programming language and it can be used on Mac OS X, Windows and Linux.

Main features:

  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples DataSources.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

last updated on:
February 18th, 2014, 20:57 GMT
file size:
5.1 MB
price:
FREE!
developed by:
Lars Marius Garshol
license type:
Apache 
operating system(s):
Mac OS X
binary format:
-
category:
Home \ Developer Tools

FREE!

In a hurry? Add it to your Download Basket!

user rating

UNRATED
0.0/5
 

0/5

1 Screenshot
Duke - Duke will report its findings to the MatchListener, you can write your own MatchListeners, or use those which come with Duke.
What's New in This Release:
  • New features:
  • Added longest common substring comparator.
  • LuceneDatabase now uses fuzzy search by default (which is much slower).
  • New default Record implementation, faster and uses less memory.
read full changelog

Add your review!

SUBMIT