A fast deduplication engine
At the moment Duke can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.
Duke can be used to find duplicate records inside a single table/data source, or it can be used to find records in different tables/sources which most likely represent the same real-world entity.
Duke is written in the Java programming language and it can be used on Mac OS X, Windows and Linux.
- High performance.
- Highly configurable.
- Support for CSV, JDBC, SPARQL, and NTriples DataSources.
- Many built-in comparators.
- Plug in your own data sources, comparators, and cleaners.
- Command-line client for getting started.
- API for embedding into any kind of application.
- Support for batch processing and continuous processing.
- Can maintain database of links found via JNDI/JDBC.
- Can run in multiple threads.
In a hurry? Add it to your Download Basket!
What's New in This Release:
- New features:
- Added longest common substring comparator.
- LuceneDatabase now uses fuzzy search by default (which is much slower).
- New default Record implementation, faster and uses less memory.