Terrier Changelog
What's new in Terrier 3.6
Apr 5, 2014- Indexing:
- TR-174: Indexing a directory breaks on special pdf- or excel files
- TR-178: Private method setPostingImplementation in BitPostingIndex
- TR-179: BitIn implementations should both inherit from a common base class
- TR-180: RandomDataInputMemory has an unnecessary binary search for every read()
- TR-181: Hadoop indexing should not copy hadoop libraries to a job classpath (contributed by Marco Didonna)
- TR-186: Inverted to Direct indexing is not exposed via the Trec_Terrier application
- TR-188: Stopwords incorrectly handles reset (contributed by Steven)
- TR-192: On Windows, the document.fsarrayfile is not closed, resulting in a _1 in the filename when indexing
- TR-194: TRECCollection docnos should be trimmed of whitespace
- TR-197: Terrier refuses to parse some topics (example included)
- TR-199: Block compression support (contributed by Benjamin Piwowarski)
- TR-200: Non unique keys in reverse index
- TR-201: Log4j conflicts can occur for hadoop indexing
- TR-202: Documentation for tokeniser property
- TR-205: Hadoop jar folder in distribution should not mention 0.20
- TR-206: Tag information are not loaded within BlockDirectIndex & BlockInvertedIndex (contributed by Sadi Samy)
- TR-207: Adhoc Evaluation returns bad precision at percent (contributed by Sadi Samy)
- TR-209: Allow long metaindex values to be cropped automatically by the MetaIndex
- TR-211: Indexer meta keys are case-sensitive, apart from docno
- TR-214: Indexing of metatags for XMLDocuments (contributed by Daniel Jimenez Kwast, Menno Tammens, Nicolas Faessel and Dennis Pallett)
- TR-216: Changing Hadoop temporary folder without recompiling
- TR-220: SimpleXMLCollection raise null pointer exception if document contains doctype with same the name than xml.doctag (contributed by Nicolas Faessel)
- TR-235: LexiconBuilder fails on empty term
- TR-247: WARC09Collection and TRECWebCollection are not consistent about the return format for parseDate
- TR-252: Update Apache POI versions to parse newer Word/Excel/Powerpoint files
- TR-257: IndexUtil.rename() should check rename() returns.
- TR-262: PostingIndex is the new DirectIndex and InvertedIndex abstract type
- TR-279: Termids should be assigned by decreasing frequency for highest direct file compression [single pass indexers]
- Retrieval:
- TR-170: ArrayIndexOutOfBoundsException in PostingListManager, add unit test for PostingListManager
- TR-173: The Decorate class incorrectly adds meta index properties when used as a PostProcess rather than a PostFilter
- TR-175: Decorate class does not remove field qualifiers when generating query-biased summaries
- TR-176: Allow abitrary context objects in SearchRequest
- TR-185: TRECQuery should not tokenise the topic number
- TR-189: TRECFullTokenizer may discard DOCNO tag, causing terrier to crash (contributed by Steven)
- TR-198: Conservative QE incorrectly weights queryterms (contributed by Saul Vargas)
- TR-203: MRF formula applies w_o twice
- TR-204: Relevance feedback for query expansion in queries without relevance judgements could throw NullPointerException
- TR-217: CS query expansion model is incorrect
- TR-228: ML2 and MDL2 are missing default constructors
- TR-229: BasicIterablePosting next(int)
- TR-230: Proximity operator ()
- TR-242: Problem with query terms frequency (key frequency = 1) using BM25
- TR-248: Error instantiating topic file QuerySource called TRECQuery
- TR-251: ResultSet implementation should know how to sort themselves
- TR-258: PhraseScoreModifier should use IterablePosting
- TR-259: FieldORIterablePosting doesnt save field lengths correctly
- TR-260: Block(Field)ORIterablePosting are inefficient
- TR-263: PostingListManager should retain the String of the term
- TR-264: Manager crops resultsets unnecessarily
- TR-266: TRECQuerying should support non textual OutputFormats
- TR-268: Query term counting doesn't work
- TR-276: Bump DFI models in core
- TR-278: The cache in WeightingModelFactory should be clearable
- Documentation:
- TR-195: Documentation should make clear that inverted files produced by different methods are identical
- TR-218: Documentation confuses block.size and blocks.size
- TR-223: Refactoring/Cleaning up of the package org.terrier.matching.models (contributed by Francois Rousseau)
- TR-224: configure_retrieval.html doesn't mention Dirichlet
- TR-227: Errors in Javadoc of In_expB2 and InB2 models
- TR-239: Clarify when Terrier Query language can be used viz TREC
- TR-269: Document default values for indexing.singlepass.max.postings.memory and indexing.max.docs.per.builder
- Other:
- TR-172: Upgrade PDFBox
- TR-210: InteractiveQuerying displays documents with scores of negative infinity
- TR-212: Evaluation doesn't support graded relevance judgements
- TR-219: TRECQrels has poor error messages
- TR-243: Terrier query language does not document multi-term field search syntax FIELD:(term1 term2)
- TR-253: Decorate & SimpleDecorate needs unit tests
- TR-254: Refactor query-biased summarisation out of Decorate
- TR-255: PostingListManager has no unit test
- TR-256: Files should check FileSystem return code for rename operation
- TR-261: CollectionStatistics should be Writable
- TR-264: Make TerrierTimer more useful
- TR-267: Get a clone of an EntryStatistics
- TR-270: Sorting of resultsets needs unit testing
- TR-271: CollectionStatistics should have a toString()
- TR-272: TerrierTimer is too verbose
- TR-273: DFR constituent models & PFN models should be Cloneable
- TR-274: Block Shakespeare tests are failing
- TR-275: TRECWebCollection doesn't normalise encodings
- TR-277: RandomDataInputMemory improvements
New in Terrier 3.5 (Jun 17, 2011)
- Indexing:
- TR-117: Improve fields support by SimpleXMLCollection
- TR-120: Error loading an additional MetaIndex structure (contributed by Javier Ortega, Universidad de Sevilla)
- TR-106: Pipeline Query/Doc Policy Lifecycle (contributed by Giovanni Stilo, University degli Studi dell'Aquila and Nestor Laboratory - University of Rome "Tor Vergata")
- TR-116: Lexicon not properly renamed on Windows
- TR-118: SimpleXMLCollection - the term near the closing tag is ignored (contributed by Damien Dudognon, Institut de Recherche en Informatique de Toulouse)
- TR-123: Null pointer exception while trying to index simple document (contributed by Ilya Bogunov)
- TR-126: Logging improvements
- TR-124: When processing docid tag in MEDLINE format XML file, xml context path is needed
- TR-127: Easier refactoring of SinglePass indexers (contributed by Jonathon Hare, University of Southampton)
- TR-108: Some indexers do not set the IterablePosting class for the DirectIndex (contributed by Richard Eckart de Castilho, Darmstadt University of Technology)
- TR-136: Hadoop indexing misbehaves when terrier.index.prefix is not "data"
- TR-137: TRECCollection cannot add properties from the document tags to the meta index at indexing time
- TR-150: TRECCollection parse DOCHDR tags, including URLs should they exist (see TRECWebCollection)
- TR-138: IndexUtil.copyStructure fails when source and destination indices are same
- TR-140: Indexing support for query-biased summarisation
- TR-144: CollectionRecordReader.next should not be recursive
- TR-146, TR-148: Tokenisation should be done separately from Document parsing (the tokeniser can be set using the property tokeniser - see Non English language support in Terrier for more information on changing the tokenisation used by Terrier); Refactor Document implementations (e.g. TRECDocument and HTMLDocument are now deprecated in favour of the new TaggedDocument)
- TR-147: Allow various Collection implementations to use different Document implementations
- TR-158: Single pass indexing with default configuration doesn't ever flush memory
- Retrieval:
- TR-16,TR-166: Extending query language and Matching to support synonyms
- TR-157: Remove TRECQuerying scripting files: trec.models, qemodels, trec.topics.list and trec.qrels - use properties in TRECQuerying instead.
- TR-156: Deploy a DAAT matching strategy - see org.terrier.matching.daat (partially contributed by Nicola Tonellotto, CNR)
- TR-113: The LGD Loglogistic weighting model (contributed by Gianni Amati, FUB)
- TR-105: Index should check version number as it can't open older indices
- TR-107: DirectIndex.getTerms() is broken
- TR-110: TRECDocnoOutputFormat assumes metadata key is "docno"
- TR-112: "Term not found" log message should not be a warning
- TR-121: Distance.noTimesSameOrder() can throw ArrayIndexOutOfBoundsException
- TR-129: Posting.getDocumentLength() does not work for postings from the direct file
- TR-130: Manager should use Index specified in Request object
- TR-131: Parsing of WeightingModel class names could be better
- TR-132: Some BitIn implementations don't pass unit tests
- TR-139: Manager should balk at null Index in constructor
- TR-141: GammaFunction is not good enough for proximity - this fixes the retrieval effectiveness of DFRDependenceScoreModifier
- TR-142: Matching implementations should not overwrite the EntryStatistics stored in the MatchingQueryTerms object
- TR-143: BitFileBuffered creates unnecessary byte arrays
- TR-145: ResultSet implementations don't retain exactResultSize() in child ResultSets
- TR-149: Added first Divergence from Independence model, TR-153,TR-154,TR-155: Provide a Matching implementation that reads results from TREC run files (see TRECResultsMatching)
- TR-160: Inv2DirectMultiReduce needs improvement to allow direct split across multiple files
- TR-161: Use Tokenisers in query side tokenisation
- TR-163: Index does not explicitly close the properties file
- TR-164: Document index structure is left open when index.close() is called
- TR-165: SingleLineTRECQuery opens all files as UTF
- TR-167: Large document metadata are stored incorrectly by MetaIndex
- Two new 2nd generation Divergence from Randomness models: JsKLs and XSqrA_M (contributed by Gianni Amati, Fondazione Ugo Bordoni)
- Testing:
- Added a considerable number of additional JUnit tests
- TR-134: BitPostingIndexInputFormat needs a unit test
- TR-135: TestPostingStructures should test skipping of stream structures
- TR-151: SimpleFileCollection and chums (FileDocument etc) have no unit test
- TR-159: Junit end-to-end test for WT2G test collection
- Desktop:
- TR-103: Desktop search cant open files on 64bit Windows
- Other:
- TR-168: Terrier batch scripts can fail when the TERRIER_HOME environment variable is set on Windows 64bit
- TR-115: Upgrade Hadoop support for 0.20
- TR-104: Move to Java 6
- TR-119: Temporary jar/properties in HDFS /tmp are not deleted
- TR-152: TagSet should detect a tag in both process and skip entries
New in Terrier 3.0 (Mar 11, 2010)
- Indexing
- TR-14, TR-42, TR-56, TR-102: Various changes to the format of the index, to promote reuse, scalability and speed.
- TR-17, TR-50, TR-54, TR-77: Added MetaIndex for document metadata. DOCNOs etc need not be in lexographical order.
- TR-43, TR-48, TR-69, TR-70: Fields should contain frequency information.
- TR-39, TR-40, TR-41, TR-46, TR-50, TR-83, TR-88: Various improvements and bug fixes to MapReduce indexing.
- TR-44, TR-55: Improve robustness of single-pass indexing.
- TR-71, TR-98: Allow Bit posting structures to be split across multiple files.
- TR-28, TR-91: Index WARC collections (UK-2006, ClueWeb09).
- TR-34: Documentation update: Property values for single-pass indexing are not scaled.
- TR-37, TR-38, TR-47,TR-57, TR-78, TR-79, TR-93, TR-94: Generate the direct file from an inverted index as a MapReduce job.
- Retrieval
- TR-20, TR-42, TR-64: Access the posting list for one term as a stream - see Posting and IterablePosting.
- TR-86: Matching should be an interface.
- TR-87: PorterStemmer doesn't match expected output by Porter himself.
- TR-81: Implements proximity term dependence models. For more information, see Configuring Retrieval.
- TR-19: Support relevance feedback as well as pseudo-relevance feedback.
- TR-68, TR-73, TR-74, TR-94: Implement field-based weighting models. For more information, see Configuring Retrieval.
- TR-99: Provide way to integrate static doc prior easily. For more information, see Configuring Retrieval.
- TR-90: MatchingQueryTerms does not retain query term order.
- TR-26: Parse Million Query track topic files.
- TR-49: Let TRECQuerying filename be predetermined by property.
- TR-75: Allow to set runtag in runs.
- TR-60: Removed PonteCroft language modelling.
- TR-66, TR-84: Refactor TRECQuery.
- TR-67: Request object should contain the Index.
- TermScoreModifiers have been deprecated, and no longer work. You should use WeightingModel instead.
- Testing
- Added considerable number of end-to-end and unit tests.
- TR-59: Fixed reset problem in Terrier evaluation tool.
- TR-76: Bump Junit version.
- Desktop
- TR-61: Desktop example app should use MetaIndex.
- Other
- TR-89: Check all .java and .sh files have Terrier license header.
- TR-82: Have a simple webapps search results interface.
- TR-80: Move code to terrier.org Java package namespaces.
- TR-45: Add (read|write)(Delta|Golomb) etc to BitIn/BitOut.
- TR-52: FSOrderedMapFile causes seek(-1) when searching for an entry less than the first.
- TR-72: FSOrderedMapFile.EntryIterator.skip() breaks FSOrderedMapFile.EntryIterator.hasNext().
- TR-95: FSArrayFile.ArrayFileIterator.skip() does not update entry index correctly.
- TR-92: utility.io.CountingInputStream does not count single bytes correctly.
- TR-53: Rounding.toString() doesnt work for 10dp.
- TR-62: Files layer can transparently cache files.
- TR-2, TR-65, TR-97: Replace Terrier's Makefile with Ant build.xml.
- TR-63,TR-101: Documentation updates.
- TR-100: Update default and sample terrier.properties files.