Apache Lucene Changelog

What's new in Apache Lucene 9.10.0

Feb 21, 2024
  • New features:
  • Support for similarity-based vector searches, ie. finding all nearest neighbors whose similarity is greater than a configured threshold from a query vector. See [Byte|Float]VectorSimilarityQuery.
  • Index sorting is now compatible with block joins. See IndexWriterConfig#setParentField.
  • MMapDirectory now takes advantage of the now finalized JDK foreign memory API internally when running on Java 22 (or later). This was only supported with Java 19 to 21 until now.
  • SIMD vectorization now takes advantage of JDK vector incubator on Java 22. This was only supported with Java 20 or 21 until now.
  • Optimizations:
  • Tail postings are now encoded using group-varint. This yielded speedups on queries that match lots of terms that have short postings lists in Lucene's nightly benchmarks.
  • Range queries on points now exit earlier when evaluating a segment that has no matches. This will improve performance when intersected with other queries that have a high up-front cost such as multi-term queries.
  • BooleanQueries that mix SHOULD and FILTER clauses now propagate minimum competitive scores to the SHOULD clauses, yielding significant speedups for top-k queries sorted by descending score.
  • IndexSearcher#count has been optimized on pure disjunctions of two term queries.

New in Apache Lucene 9.9.2 (Jan 30, 2024)

  • Bug fixes:
  • Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat (Ben Trent)
  • Rollback the tmp storage of BytesRefHash to -1 after sort (Guo Feng)

New in Apache Lucene 9.9.1 (Dec 17, 2023)

  • Bug fixes:
  • JVM SIGSEGV crash when compiling computeCommonPrefixLengthAndBuildHistogram (Chris Hegarty)
  • Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped (Guo Feng, Mike McCandless)

New in Apache Lucene 9.9.0 (Dec 5, 2023)

  • New Features:
  • Add int8 scalar quantization to the HNSW vector format. This optionally allows for more compact lossy storage for the vectors, requiring approximately 4x less memory for fast HNSW search.
  • HNSW graph now can be merged with multiple threads, leveraging the same infrastructure that inter-segment concurrency utilizes.
  • Improvements:
  • Speed up Panama vector support, use FMA, and test improvements.
  • FSTCompiler can now approximately limit how much RAM it uses to share suffixes during FST construction using the suffixRAMLimitMB method.
  • Optimizations:
  • Faster top-level conjunctions on term queries when sorting by descending score.
  • Change Postings back to using FOR in Lucene99PostingsFormat. Freqs, positions and offset keep using PFOR.

New in Apache Lucene 9.8.0 (Sep 29, 2023)

  • Optimizations:
  • Faster computation of top-k hits on boolean queries. Lucene's nightly benchmarks report a 20%-30% speedup for disjunctive queries and a 11%-13% speedup for conjunctive queries since Lucene 9.7. Disjunctive queries with many and/or high-frequency terms should see even higher speedups.
  • Faster computation of top-k hits when sorting by field. Lucene's nightly benchmarks report speedups between 7% and 33% since 9.7 depending on the type and cardinality of the field that is used for sorting.
  • Faster indexing of numeric doc values when index sorting is turned on.
  • Expressions now evaluate all arguments in a fully lazy manner, which may provide significant speedups and throughput improvements for heavy expression users.
  • API Changes:
  • Move max vector dims limit to Codec (Mayya Sharipova)
  • New features:
  • Introduced LeafCollector#finish, a hook that runs after collection has finished running on a leaf.
  • Add "KnnCollector" to "LeafReader" and "KnnVectorReader" so that custom collection of vector search results can be provided. The first custom collector provides "ToParentBlockJoin[Float|Byte]KnnVectorQuery" joining child vector documents with their parent documents.
  • Add support for recursive graph bisection, also called bipartite graph partitioning, and often abbreviated BP, an algorithm for reordering doc IDs that results in more compact postings and faster queries, especially conjunctions.
  • Bug fixes:
  • Fix HNSW graph search bug that potentially leaked unapproved docs
  • Fix bug in TermsEnum#seekCeil on doc values terms enums that causes IndexOutOfBoundsException.

New in Apache Lucene 9.7.0 (Jun 26, 2023)

  • New features:
  • The new IndexWriter#updateDocuments(Query, Iterable) allows updating multiple documents that match a query at the same time.
  • Function queries can now compute similarity scores between kNN vectors.
  • Optimizations:
  • KNN indexing and querying can now take advantage of vectorization for distance computation between vectors. To enable this, use exactly Java 20 or 21, and pass --add-modules jdk.incubator.vector as a command-line parameter to the Java program.
  • KNN queries now run concurrently if the IndexSearcher has been created with an executor.
  • Queries sorted by field are now able to dynamically prune hits only using the after value. This yields major speedups when paginating deeply.
  • Reduced merge-time overhead of computing the number of soft deletes.
  • Changes in runtime behavior:
  • KNN vectors are now disallowed to have non-finite values such as NaN or ±Infinity.
  • Bug fixes:
  • Backward reading is no longer an adversarial case for BufferedIndexInput, used by NIOFSDirectory and SimpleFSDirectory. This addresses a performance bug when performing terms dictionary lookups with either of these directories.
  • GraphTokenStreamFiniteStrings#articulationPointsRecurse may no longer overflow the stack.
  • ... plus a number of helpful bug fixes!

New in Apache Lucene 9.6.0 (May 10, 2023)

  • Introduce a new KeywordField for simple and efficient filtering, sorting and faceting.
  • Add support for Java 20 foreign memory API. If exactly Java 19 or 20 is used, MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) and indexes closed while queries are running can no longer crash the JVM.
  • Improved performance for TermInSetQuery, PrefixQuery, WildcardQuery and TermRangeQuery
  • Lower memory usage for BloomFilteringPostingsFormat
  • Faster merges for HNSW indexes
  • Improvements to concurrent indexing throughput under heavy load
  • Correct equals implementation in SynonymQuery
  • 'explain' is now implemented on TermAutomatonQuery

New in Apache Lucene 9.5.0 (Jan 30, 2023)

  • New features:
  • Added KnnByteVectorField and ByteVectorQuery that are specialized for indexing and querying byte-sized vectors. Deprecated KnnVectorField, KnnVectorQuery and LeafReader#getVectorValues in favour of the newly introduced KnnFloatVectorField, KnnFloatVectorQuery and LeafReader#getFloatVectorValues that are specialized for float vectors.
  • Added IntField, LongField, FloatField and DoubleField: easy to use numeric fields that perform well both for filtering and sorting.
  • Support for Java 19 foreign memory access ("project Panama") was enabled by default removing the need to provide the "--enable-preview" flag.
  • Added ByteWritesTrackingDirectoryWrapper to expose metrics for bytes merged, flushed, and overall write amplification factor.
  • Optimizations:
  • Improved storage efficiency of connections in the HNSW graph used for vector search
  • Added new stored fields and term vectors interfaces: IndexReader#storedFields and IndexReader#termVectors. These do not rely upon ThreadLocal storage for each index segment, which can greatly reduce RAM requirements when there are many threads and/or segments.
  • Several improvements were made to IndexSortSortedNumericDocValuesRangeQuery including query execution optimization with points for descending sorts and BoundedDocIdSetIterator construction sped up using bkd binary search.
  • Other:
  • Moved DocValuesNumbersQuery from sandbox to NumericDocValuesField#newSlowSetQuery
  • Fix exponential runtime for nested BooleanQuery#rewrite with non scoring clauses

New in Apache Lucene 9.4.2 (Nov 27, 2022)

  • Fixed integer overflow when opening segments containing more than ~16M KNN vectors.
  • Fixed cost computation of BitSets created via DocIdSetBuilder, such as for multi-term queries. This may improve performance of multi-term queries.
  • CheckIndex now verifies the consistency of KNN vectors more thoroughly.

New in Apache Lucene 9.4.1 (Oct 24, 2022)

  • When reading large segments, the kNN vectors format could fail with a validation error, preventing further writes or searches on the index. This bug is now fixed. Only version 9.4.0 was affected, so it is recommended to skip 9.4.0 if you are using kNN vectors.

New in Apache Lucene 9.4.0 (Oct 2, 2022)

  • New features:
  • Added ShapeDocValues/Field, a unified abstraction to represent existing types: XY and lat/long.
  • FacetSets can now be filtered using a Query via MatchingFacetSetCounts.
  • SortField now allows control over whether to apply index-sort optimizations.
  • Support for Java 19 foreign memory access ("project Panama") was added. Applications started with command line parameter "java --enable-preview" will automatically use the new foreign memory API of Java 19 to access indexes on disk with MMapDirectory. This is an opt-in feature and requires explicit Java command line flag passed to your application's Java process (e.g., modify startup parameters of Solr or Elasticsearch/Opensearch)! When enabled, Lucene logs a notice using java.util.logging. Please test thoroughly and report bugs/slowness to Lucene's mailing list. When the new API is used, MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) and indexes closed while queries are running can no longer crash the JVM.
  • Optimizations:
  • Added support for dynamic pruning to queries sorted by a string field that is indexed with both terms and SORTED or SORTED_SET doc values. This can lead to dramatic speedups when applicable.
  • TermInSetQuery is optimized for the case when one of its terms matches all docs in a segment, and it now provides cost estimation, making it usable with IndexOrDocValuesQuery for better query planning.
  • KnnVector fields can now be stored with reduced (8-bit) precision, saving storage and yielding a small query latency improvement.
  • Other:
  • KnnVector fields' HNSW graphs are now created incrementally when new documents are added, rather than all-at-once when flushing. This yields more consistent predictable behavior at the cost of an overall increase in indexing time.
  • randomizedtesting dependency upgraded to 2.8.1
  • addIndexes(CodecReader) now respects MergePolicy and MergeScheduler, enabling it to do its work concurrently.

New in Apache Lucene 9.3.0 (Jul 31, 2022)

  • Merge on full flush is enabled now by default with a timeout of 500ms, giving the merge policy a chance to merge NRT segments together before publishing a new point-in-time view of the IndexReader. This should give queries a small performance boost in the near-realtime case, especially terms-dictionary-intensive queries like fuzzy queries.
  • Add getAllChildren functionality to facets.
  • Added facetsets module for high dimensional (hyper-rectangle) faceting.
  • Top-level two-clause disjunctions sorted by score now use the block-max MAXSCORE algorithm, which introduced a 40%-75% speedup in our benchmarks.
  • BooleanQuery can return quick counts for simple boolean queries.
  • When running KnnVectorQuery with a filter, reuse the cached filter bit set.

New in Apache Lucene 9.2.0 (May 25, 2022)

  • Numerous improvements to indexing and query performance for KNN vectors
  • More efficient implementations for count operations on range queries
  • A new FieldExistsQuery that chooses the best index structures to run over for you
  • A new Persian stemmer

New in Apache Lucene 9.1.0 (Mar 22, 2022)

  • New features:
  • Lucene JARs are now proper Java modules, with module descriptors and dependency information
  • Support for filtering in nearest-neighbor vector search
  • Support for intervals queries in the standard query syntax
  • A new token filter SpanishPluralStemFilter for precise stemming of Spanish plurals
  • Optimizations:
  • Up to 30% improvement in index throughput for high-dimensional vectors
  • Up to 10% faster nearest neighbor searches on high-dimensional vectors
  • Faster execution of "count" searches across different query types
  • Faster counting for taxonomy facets
  • Several other search speed-ups, including improvements to PointRangeQuery, MultiRangeQuery, and CoveringRangeQuery
  • Other:
  • The test framework is now a module, so all classes have been moved from to org.apache.lucene.tests.* to avoid package name conflicts
  • Lucene now faithfully implements the HNSW algorithm for nearest neighbor search by supporting multiple graph layers

New in Apache Lucene 9.0.0 (Dec 8, 2021)

  • System requirements:
  • Lucene 9.0 requires JDK 11 or newer
  • New features:
  • Support for indexing high-dimensionality numeric vectors to perform nearest-neighbor search, using the Hierarchical Navigable Small World graph algorithm
  • New Analyzers for Serbian, Nepali, and Tamil languages
  • IME-friendly autosuggest for Japanese
  • Snowball 2, adding Hindi, Indonesian, Nepali, Serbian, Tamil, and Yiddish stemmers
  • New normalization/stemming for Swedish and Norwegian
  • Optimizations:
  • Up to 400% faster taxonomy faceting
  • 10-15% faster indexing of multi-dimensional points
  • Several times faster sorting on fields that are indexed with points. This optimization used to be an opt-in in late 8.x releases and is now opt-out as of 9.0.
  • ConcurrentMergeScheduler now assumes fast I/O, likely improving indexing speed in case where heuristics would incorrectly detect whether the system had modern I/O or not
  • Encoding of postings lists changed from FOR-delta to PFOR-delta to save further disk space
  • Other:
  • File formats have all been changed from big-endian order to little endian order
  • Lucene 9 no longer has split packages. This required renaming some packages outside of the lucene-core JAR, so you will need to adjust some imports accordingly.
  • Using Lucene 9 with the module system should be considered experimental. We expect to make progress on this in future 9.x releases.

New in Apache Lucene 8.11.0 (Nov 17, 2021)

  • Facets now properly ignore deleted documents when accumulating facet counts for all documents.
  • CheckIndex can run concurrently.

New in Apache Lucene 8.10.1 (Oct 18, 2021)

  • Bug fixes:
  • MultiCollector now handles single leaf collector that wants to skip low-scoring hits but the combined score mode doesn't allow it.
  • Fix for sort optimization with search_after that was wrongly skipping document whose values are equal to the last value of the previous page.
  • Fix for sort optimization with a chunked bulk scorer that was wrongly skipping documents.

New in Apache Lucene 8.10.0 (Sep 30, 2021)

  • New features:
  • Multi-valued fields are now supported in numeric range facet counting
  • Added new analyzer for Telugu
  • Near-real-time readers opened from an IndexCommit can now sort their leaves
  • SimpleText codec now implements skipping for its postings lists
  • Optimizations:
  • Performance improvements for faceting, including a new protected API to control which fields are counted for drill-down during drill sideways, and optimized drill sideways iterating
  • RegexpQuery's detection of adversarial (ReDoS) regular expressions is improved, catching exotic cases that it missed before, and throwing TooComplexToDeterminizeException
  • Speedup for computing the leading prefix and trailing suffix from an Automaton, and for managing powersets during determinize
  • Speedups for stored fields retrieval with the default codec (BEST_SPEED)
  • IndexWriter uses less RAM when buffering documents, especially in the case of many unique fields
  • forceMerge will now merge any number of segments at once, making it much faster in many cases
  • Compression improvements for docvalues storage

New in Apache Lucene 8.9.0 (Jun 18, 2021)

  • Compression was added to SortedSet DocValues, which allowed to significantly reduce their size on disk.
  • BM25FQuery was extended to handle similarities beyond BM25Similarity. It was renamed to CombinedFieldQuery to reflect its more general scope.
  • A new PatternTypingFilter was added to allow setting a type attribute on tokens based on a configured set of regular expressions.
  • An option was added to supply a custom leaf sorter for IndexWriter and DirectoryReader, which allows to speed up sort queries with a provided sort criteria.

New in Apache Lucene 8.8.2 (Apr 13, 2021)

  • LUCENE-9870: Fix Circle2D intersectsLine t-value (distance) range clamp
  • LUCENE-9744: NPE on a degenerate query in MinimumShouldMatchIntervalsSource$MinimumMatchesIterator.getSubMatches().
  • LUCENE-9762: DoubleValuesSource.fromQuery (also used by FunctionScoreQuery.boostByQuery) could throw an exception when the query implements TwoPhaseIterator and when the score is requested repeatedly

New in Apache Lucene 8.8.1 (Feb 24, 2021)

  • New Features:
  • LUCENE-9552: New LatLonPoint query that accepts an array of LatLonGeometries.
  • LUCENE-9641: LatLonPoint query support for spatial relationships.
  • LUCENE-9553: New XYPoint query that accepts an array of XYGeometries.
  • LUCENE-9594: FeatureField supports newLinearQuery that for scoring uses raw indexed values of features without any transformation.
  • LUCENE-9378: Doc values now allow configuring how to trade compression for retrieval speed.
  • LUCENE-9413: Add CJKWidthCharFilter and its factory
  • Improvements:
  • LUCENE-9455: ExitableTermsEnum should sample timeout and interruption check before calling next().
  • LUCENE-9023: GlobalOrdinalsWithScore should not compute occurrences when the provided min is 1.
  • LUCENE-9675: Binary doc values fields now expose their configured compression mode in the attributes of the field info.
  • Optimizations:
  • LUCENE-9536: Reduced memory usage for OrdinalMap when a segment has all values.
  • LUCENE-9021: QueryParser: re-use the LookaheadSuccess exception.
  • LUCENE-9636: Faster decoding of postings for some numbers of bits per value.
  • LUCENE-9346: WANDScorer now supports queries that have a `minimumNumberShouldMatch` configured.
  • Bug Fixes:
  • LUCENE-9508: DocumentsWriter was only stalling threads for 1 second allowing documents to be indexed even the DocumentsWriter wasn't able to keep up flushing. Unless IW can't make progress due to an ill behaving DWPT this issue was barely noticeable.
  • LUCENE-9581: Japanese tokenizer should discard the compound token instead of disabling the decomposition of long tokens when discardCompoundToken is activated.
  • LUCENE-9595: Make Component2D#withinPoint implementations consistent with ShapeQuery logic.
  • LUCENE-9606: Wrap boolean queries generated by shape fields with a Constant score query.
  • LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup.
  • LUCENE-9617: Fix per-field memory leak in IndexWriter.deleteAll(). Reset next available internal field number to 0 on FieldInfos.clear(), to avoid wasting FieldInfo references.
  • LUCENE-9642: When encoding triangles in ShapeField, make sure generated triangles are CCW by rotating triangle points before checking triangle orientation.
  • LUCENE-9661: Fix deadlock in TermsEnum.EMPTY that occurs when trying to initialize TermsEnum and BaseTermsEnum at the same time
  • Other:
  • SOLR-14995: Update Jetty to 9.4.34
  • LUCENE-9637: Removes some unused code and replaces the Point implementation on ShapeField/ShapeQuery random tests.

New in Apache Lucene 8.8.0 (Feb 2, 2021)

  • LatLonPoint query that accepts an array of LatLonGeometries, support for spatial relationships,
  • XYPoint query that accepts an array of XYGeometries
  • Doc values now allow configuring how to trade compression for retrieval speed

New in Apache Lucene 8.7.0 (Nov 4, 2020)

  • Lucene 8.7 Release Highlights: Better compression of stored fields. Stored fields now use dictionaries in order to improve the compression ratio when there is a lot of redundancy across documents. This works for both the BEST_SPEED and the BEST_COMPRESSION modes.
  • Faster sorting by field. When a doc-value field is also indexed with points, Lucene now takes advantage of this points index in order to skip documents whose sort value is not competitive.
  • Faster flushing of stored fields when index sorting is enabled. This can significantly speed up indexing when a non-negligible amount of data is stored in the index and index sorting is enabled.

New in Apache Lucene 8.6.3 (Oct 9, 2020)

  • This release contains no additional bug fixes over the previous version 8.6.2

New in Apache Lucene 8.6.2 (Sep 1, 2020)

  • Bug Fixes:
  • LUCENE-9478: IndexWriter leaked about 500 byte of heap space for each full-flush, getReader or commit. This was a regression in 6.8.0

New in Apache Lucene 8.6.1 (Aug 16, 2020)

  • LUCENE-9443: The UnifiedHighlighter was closing the underlying reader when there were multiple term-vector fields.

New in Apache Lucene 8.6.0 (Jul 17, 2020)

  • API change in: SimpleFSDireectory, IndexWriterConfig, MergeScheduler, SortFields, SimpleBindings, QueryVisitor, DocValues, CodecUtil.
  • New: IndexWriter merge-on-commit feature to selectively merge small segments on commit, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching.
  • New: Grouping by range based on DoubleValueSource and LongValueSource.
  • Optimizations: BKD trees and index, DoubleValuesSource/QueryValueSource, UsageTrackingQueryingCachingPolicy, FST, Geometry queries, Points, UniformSplit.
  • Others: Ukrainian analyzer, checksums verification, resource leaks fixes.

New in Apache Lucene 8.5.2 (May 27, 2020)

  • Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

New in Apache Lucene 8.5.1 (Apr 16, 2020)

  • Bug Fixes:
  • LUCENE-9300: Index corruption with doc values updates and addIndexes.

New in Apache Lucene 8.5.0 (Mar 26, 2020)

  • XYPointField allows you to index points in flat X,Y space and efficiently find documents that fall within a bounding box, distance or arbitrary polygon
  • New query builders on LatLonShape allow you to efficiently find documents with a specific relation to a point or polygon
  • You can now store up to 16 data dimensions in a Point field
  • KoreanTokenizer supports custom dictionaries
  • Binary doc values are now compressed, and term dictionaries have improved compression
  • Index flushes are up to 20% faster if all docvalues updates are updating a single field to the same value
  • The index of stored fields and term vectors is now stored off-heap
  • Query parsers based on QueryBuilder can boost particular terms or synonyms by setting BoostAttribute values on a token stream
  • Intervals queries correctly handle repeated subterms in ordered and unordered sources

New in Apache Lucene 8.4.0 (Dec 30, 2019)

  • LatLonShape now supports the "CONTAINS" relation, which enables to find all indexed shapes that contain the query shape.
  • Concurrent search is getting more efficient by allowing collectors to share information across threads in order to more efficiently skip non-competitive hits.
  • Faster FST lookups on dense nodes.
  • Postings are now decoded using SIMD instructions.
  • LRUQueryCache includes new heuristics that prevent caching from hurting latency too much.
  • LatLonShape builds a more efficient tree that is expected to translate into search speed improvements.
  • BaseDirectoryReader no longer sums up document counts across leaves eagerly, allowing for more efficient reader views that hide a subset of documents.
  • The index on top of BKD trees is now stored off-heap with MMapDirectory.
  • Simple Intervals Queries supports highlighting.
  • Reading DocValues can be interrupted when timeout is exceeded.

New in Apache Lucene 8.3.1 (Dec 5, 2019)

  • Bugfix: MultiTermIntervalsSource.visit() was not calling back to its visitor

New in Apache Lucene 8.3.0 (Nov 4, 2019)

  • New SpanishMinimalStemFilter
  • New "export all terms and doc freqs" feature to Luke with delimiters
  • Composite Matches from multiple subqueries now allow access to their submatches, and a new NamedMatches API allows marking of subqueries and a simple way to find which subqueries have matched on a given document
  • Range Query For Multiple Connected Ranges
  • LatLonDocValuesPointInPolygonQuery for LatLonDocValuesField
  • New UniformSplitPostingsFormat (name "UniformSplit") primarily benefiting in simplicity and extensibility
  • New STUniformSplitPostingsFormat (name "SharedTermsUniformSplit") that shares a single internal term dictionary across fields
  • DisjunctionMaxQuery more efficiently leverages impacts to skip non-competitive hits
  • BooleanQuery with no scoring clause can now early terminate the query when the total hits is not requested
  • Matches on wildcard queries will defer building their full disjunction until a MatchesIterator is pulled
  • spatial-extras quad and packed quad prefix trees now index points faster
  • Add additional leaf node level optimizations in LatLonShapeBoundingBoxQuery
  • Improve performance of WITHIN and DISJOINT queries for Shape queries by doing just one pass whenever possible
  • Introduce shared count based early termination across multiple slices
  • Blocktree's seekExact now short-circuits false if the term isn't in the min-max range of the segment. Large perf gain for ID/time like data when populated sequentially
  • Show SPI names instead of class names in Luke Analysis tab
  • GraphTokenStreamFiniteStrings preserves all Token attributes through its finite strings TokenStreams
  • Introduced SpanPositionRange into XML Query Parser
  • Use a sort key instead of true distance in NearestNeighbor
  • Tessellator labels the edges of the generated triangles whether they belong to the original polygon
  • Use exact distance between point and bounding rectangle in FloatPointNearestNeighbor
  • The Korean analyzer now splits tokens on boundaries between digits and alphabetic characters
  • MoreLikeThis is biased for uncommon fields

New in Apache Lucene 8.2.0 (Jul 26, 2019)

  • API Changes:
  • Intervals queries has been moved from the sandbox to the queries module.
  • New Features:
  • New XYShape Field and Queries for indexing and querying general cartesian geometries.
  • Snowball stemmer/analyzer for the Estonian language.
  • Provide a FeatureSortfield to allow sorting search hits by descending value of a feature.
  • Add new KoreanNumberFilter that can change Hangul character to number and process decimal point.
  • Add doc-value support to range fields.
  • Add monitor subproject (previously Luwak monitoring library) that allows a stream of documents to be matched against a set of registered queriesin an efficient manner.
  • Add a numeric range query in sandbox that takes advantage of index sorting.Add a numeric range query in sandbox that takes advantage of index sorting.
  • Optimizations:
  • Use exponential search instead of binary search in IntArrayDocIdSet#advance method.
  • Use incoming thread for execution if IndexSearcher has an executor. Now caller threads execute at least one search on an index even if there is an executor provided to minimize thread context switching.
  • New storing strategy for BKD tree leaves with low cardinality that can lower storage costs and It can be used at search time to speed up queries.
  • Load frequencies lazily only when needed in BlockDocsEnum and BlockImpactsEverythingEnum.
  • Phrase queries now leverage impacts.

New in Apache Lucene 8.1.1 (May 29, 2019)

  • This release contains no change over 8.1.0.

New in Apache Lucene 8.1.0 (May 16, 2019)

  • A query introspection API has been introduced.
  • Luke, well-known GUI for inspecting Lucene indexes, now added as a Lucene module
  • Merging dimensional points to use radix partitioning, which has also been optimized
  • Bugfix: LatLonShapePolygonQuery returns incorrect WITHIN results with shared boundaries
  • TieredMergePolicy#findForcedMerges now tries to create the cheapest merges
  • Build point writers in the BKD tree only when they are needed
  • SynonymQuery can now deboost the document frequency of each term when blending synonym scores
  • ConstantScoreQuery can early terminate if minimum score > constant score (total hits are not requested)
  • DateRangePrefixTree can now parse more precise dates

New in Apache Lucene 8.0.0 (Mar 18, 2019)

  • Query execution:
  • Term queries, phrase queries and boolean queries introduced new optimization that enables efficient skipping over non-competitive documents when the total hit count is not needed. Depending on the exact query and data distribution, queries might run between a few percents slower and many times faster, especially term queries and pure disjunctions.
  • In order to support this enhancement, some API changes have been made: * TopDocs.totalHits is no longer a long but an object that gives a lower bound of the actual hit count. * IndexSearcher's search and searchAfter methods now only compute total hit counts accurately up to 1,000 in order to enable this optimization by default. * Queries are now required to produce non-negative scores.
  • Codecs:
  • Postings now index score impacts alongside skip data. This is how term queries optimize collection of top hits when hit counts are not needed.
  • Doc values introduced jump tables, so that advancing runs in constant time. This is especially helpful on sparse fields.
  • The terms index FST is now loaded off-heap for non-primary-key fields using MMapDirectory, reducing heap usage for such fields.
  • Custom scoring:
  • The new FeatureField allows efficient integration of static features such as a pagerank into the score. Furthermore, the new LongPoint#newDistanceFeatureQuery and LatLonPoint#newDistanceFeatureQuery methods allow boosting by recency and geo-distance respectively. These new helpers are optimized for the case when total hit counts are not needed. For instance if the pagerank has a significant weight in your scores, then Lucene might be able to skip over documents that have a low pagerank value.

New in Apache Lucene 7.7.0 (Feb 12, 2019)

  • Fix LatLonShape WITHIN queries that fail with Multiple search Polygons that share the dateline.
  • LatLonShape's within and disjoint queries can return false positives with indexed multi-shapes.
  • ExitableDirectoryReader may now time out queries that run on points such as range queries or geo queries.
  • StandardTokenizer and UAX29URLEmailTokenizer now support Unicode 9.0, and provide Unicode UTS#51 v11.0 Emoji tokenization with the "" token type.
  • TopFieldCollector can now early-terminates queries when sorting by SortField.DOC.
  • Speed up merging segments of points with data dimensions by only sorting on the indexed dimensions.
  • The KoreanTokenizer no longer splits unknown words on combining diacritics and detects script boundaries more accurately with Character#UnicodeScript#of.
  • Change LatLonShape encoding to use 4 bytes Per Dimension.
  • BufferedUpdates now uses an optimized storage for buffering docvalues updates that can save up to 80% of the heap used compared to the previous implementation and uses non-object based datastructures.
  • Moved to the default accepted overhead ratio for packet ints in DocValuesFieldUpdates yields an up-to 4x performance improvement when applying doc values updates.
  • Doc-value updates get applied faster by sorting with quicksort, rather than an in-place mergesort, which needs to perform fewer swaps.
  • Decrease I/O pressure when merging high dimensional points.

New in Apache Lucene 7.6.0 (Feb 12, 2019)

  • Index sorting corruption due to numeric overflow has been fixed. Indices affected by this bug can be detected by running the CheckIndex command on a 7.6+ release distribution.
  • Better tessellation processing of Polygons including graceful exceptions for detecting invalid shapes.
  • Points codec now supports selective indexing; the ability to designate dimensions as as "data only" dimensions that do not affect construction of the index.
  • New Simple WKT Shape Parser builds lucene geometries (polygons, lines, rectangles) from WKT format.
  • New LatLonShapeLineQuery queries indexed shapes with arbitrary lines.
  • analyzeGraphPhrase query builder creates one phrase query per finite strings in the graph based on slop parameter.
  • Performance in PerFieldMergeState#FilterFieldInfos has been improved from O(N) to O(1) lookup time.

New in Apache Lucene 7.5.0 (Sep 25, 2018)

  • IndexWriter#deleteDocs(Query... query) applies deletes to wrong documents if the index is sorted.
  • TieredMergePolicy now respects maxSegmentSizeMB by default when executing findForcedMerges and findForcedDeletesMerges.
  • A new points based Shape Indexing and Searching that decomposes shapes into a triangular mesh and indexes individual triangles as a 6 dimension point.
  • A new ByteBuffer based Directory implementation that aims to replace the deprecated RAMDirectory.
  • The UnifiedHighlighter can now use the MatchesIterator API to highlight any query more accurately.
  • TopFieldComparator can now stop comparing documents if the index is sorted, even if hits still need to be visited to compute the hit count.
  • TieredMergePolicy can control how aggressively deletes should be reclaimed with the new deletesPctAllowed setting.
  • Please read CHANGES.txt for a full list of new features and changes:
  • https://lucene.apache.org/core/7_5_0/changes/Changes.html

New in Apache Lucene 7.4.0 (Jul 4, 2018)

  • This release contains numerous bug fixes, optimizations, and improvements:
  • https://lucene.apache.org/core/7_4_0/changes/Changes.html

New in Apache Lucene 7.3.1 (May 20, 2018)

  • Bug fixes:
  • LUCENE-8254: LRUQueryCache could cause IndexReader to hang on close, when shared with another reader with no CacheHelper.

New in Apache Lucene 7.3.0 (Apr 5, 2018)

  • Bug Fixes:
  • LUCENE-8077: Fixed bug in how CheckIndex verifies doc-value iterators.
  • SOLR-11758: Fixed FloatDocValues.boolVal to correctly return true for all values != 0.0F
  • LUCENE-8121: The UnifiedHighlighter would highlight some terms within some nested SpanNearQueries at positions where it should not have. It's fixed in the UH by switching to the SpanCollector API. The original Highlighter still has this problem (LUCENE-2287, LUCENE-5455, LUCENE-6796). Some public but internal parts of the UH were refactored.
  • LUCENE-8120: Fix LatLonBoundingBox's toString() method
  • LUCENE-8130: Fix NullPointerException from TermStates.toString()
  • LUCENE-8124: Fixed HyphenationCompoundWordTokenFilter to handle correctly hyphenation patterns with indicator >= 7.
  • LUCENE-8163: BaseDirectoryTestCase could produce random filenames that fail on Windows
  • LUCENE-8174: Fixed {Float,Double,Int,Long}Range.toString().
  • LUCENE-8182: Fixed BoostingQuery to apply the context boost instead of the parent query boost
  • LUCENE-8188: Fixed bugs in OpenNLPOpsFactory that were causing InputStreams fetched from the ResourceLoader to be leaked
  • Further details at:
  • https://lucene.apache.org/core/7_3_0/changes/Changes.html

New in Apache Lucene 7.2.1 (Jan 16, 2018)

  • Fix advanceExact on SortedNumericDocValues produced by Lucene54DocValuesProducer.

New in Apache Lucene 7.1.0 (Oct 17, 2017)

  • Highlights:
  • New Geo3D shapes for non-spherical planet models
  • Serialization and deserialization support for Geo3D
  • A new CoveringQuery, whose required number of matching clauses can be defined per document
  • New BengaliAnalyzer for Bengali language
  • A point based range field called LatLonBoundingBox
  • FloatPointNearestNeighbor, an N-dimensional FloatPoint K-nearest-neighbor search implementation
  • Faster default taxonomy cache
  • Support for computing facet counts for individual numeric values via LongValueFacetCounts
  • Faster geo-distance queries in case of dense single-valued fields when most documents match
  • Better heuristics in IndexOrDocValuesQuery
  • Optimized builds for OrdinalMap (used by SortedSetDocValuesFacetCounts and others)

New in Apache Lucene 7.0.1 (Oct 7, 2017)

  • Bug fixes:
  • ConjunctionScorer.getChildren was failing to return all child scorers.

New in Apache Lucene 7.0.0 (Sep 21, 2017)

  • Highlights:
  • Doc values switched from random access to iterators.
  • The 7.0 codec now sparsely encodes sparse doc values and length normalization factors ("norms"), which also translates to optimization in both indexing, and search on sparse values. With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc.
  • Index time boost for documents is now removed.
  • Substantial performance gains for delete and update heavy Lucene usage; see http://blog.mikemccandless.com/2017/07/lucene-gets-concurrent-deletes-and.html for details
  • Query scoring is now simpler with removal of coord factor, and query normalization.
  • Classic query parser no longer splits on whitespaces. This enables better multi-word synonym support.
  • The version of Lucene that created the index segment would be recorded, along with the version that last modified the index.
  • IndexWriter, used to add, update and delete documents in your index, will no longer accept broken token offsets sometimes produced by mis-behaving token filters.
  • IndexReader exposes methods that are typically used to manage resources whose lifetime needs to mimic the lifetime of segments/indexes, typically caches. They have been made much less trappy.
  • The dimensional points API now takes a field name up front to offer per-field points access, matching how the doc values APIs work.
  • The PostingsHighlighter was removed. Migrating to the UnifiedHighlighter should be straight-forward.

New in Apache Lucene 6.6.1 (Sep 6, 2017)

  • Bug Fixes:
  • LUCENE-7869: Changed MemoryIndex to sort 1d points. In case of 1d points, the PointInSetQuery.MergePointVisitor expects that these points are visited in ascending order. The memory index doesn't do this and this can result in document with multiple points that should match to not match.
  • LUCENE-7878: Fix query builder to keep the SHOULD clause that wraps multi-word synonyms.

New in Apache Lucene 6.6.0 (Jun 7, 2017)

  • Highlights:
  • A concurrent SortedSet facets implementation
  • spatial-extras HeatmapFacetCounter will now short-circuit it's work when Bits.MatchNoBits is passed
  • OfflineSorter now passes the total number of items it will write to getWriter()
  • Move dictionary for Ukrainian analyzer to external dependency
  • SortedSetDocValuesReaderState now implements Accountable so one can see how much RAM it is using
  • OfflineSorter can now run concurrently if you pass it an optional ExecutorService Sorted set facets now use sparse storage when collecting hits, when appropriate
  • PostingsHighlighter has been deprecated in favour of the UnifiedHighlighter

New in Apache Lucene 6.5.1 (Apr 28, 2017)

  • Bug fixes:
  • Fixed join queries to not reference IndexReaders, as it could cause leaks if they are cached.
  • Made LRUQueryCache delegate the scoreSupplier method.
  • Fixed index sorting to work with sparse numeric and binary docvalues field

New in Apache Lucene 6.5.0 (Mar 27, 2017)

  • It is now possible filter out duplicates in the NRT suggester
  • SimpleQueryString now supports default fuziness
  • IndexWriter can return the list of visible field names
  • DisjunctionScorer now supports returning the matching children clauses
  • A new FunctionScoreQuery that modifies the internal query's score using the per-document values
  • A new FunctionMatchQuery that returns any documents with a value that matches a predicate
  • A new WordDelimiterGraphFilter that outputs a correct graph structure for multi-token expansion at query time
  • A new PatternTokenizer that uses Lucene's RegExp implementation
  • RangeFieldQuery now supports CROSSES relation
  • A new IndexOrDocValuesQuery that uses either an index (points or terms) or doc values in order to run a (range, geo box and distance) query, depending which one is more efficient
  • index-time boosts are deprecated
  • Term filters are no longer cached
  • Compound filters are cached earlier than regular queries
  • BKDReader now calls grow on larger increments
  • LatLonPointInPolygonQuery are faster
  • LatLonPointDistanceQuery now skips distance computations more often
  • To-parent block joins now implements two-phase iteration
  • Point ranges that match most documents are faster
  • PointValues#estimatePointCount is faster with Relation.CELL_INSIDE_QUERY
  • Segments are now also sorted during flush, and merging on a sorted index is substantially faster by using some of the same bulk merge optimizations that non-sorted merging uses

New in Apache Lucene 6.4.2 (Mar 9, 2017)

  • Bug Fixes:
  • LUCENE-7698: CommonGramsQueryFilter was producing a disconnected token graph, messing up phrase queries when it was used during query parsing
  • LUCENE-7676: Fixed FilterCodecReader to override more super-class methods. Also added TestFilterCodecReader class.
  • LUCENE-7717: The UnifiedHighlighter and PostingsHighlighter were not highlighting prefix queries with multi-byte characters. TermRangeQuery is affected too.

New in Apache Lucene 6.4.1 (Feb 7, 2017)

  • Highlights:
  • Javadocs now build successfully with Java 8u121
  • Fixed memory leak in the case that TermQuery or SpanTermQuery objects that wrap a TermContext were cached
  • Fixed native memory leak when the codec is configured with the BEST_COMPRESSION option
  • AnalyzingInfixSuggester now only opens an IndexWriter when changes need to be applied

New in Apache Lucene 6.4.0 (Jan 24, 2017)

  • Highlights:
  • Lucene's best efforts to un-map memory mapped files with "MMapDirectory" now work with the latest Java9 early access builds
  • A new similarity "BooleanSimilarity" that gives terms a score that is equal to their query boost
  • The axiomatic family of similarities (6 in total) based on https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
  • A new token filter "SynonymGraphFilter" that outputs a correct graph structure for multi-token synonyms at query time
  • Graph token streams, such as those produced by the "SynonymGraphFilter", are now handled accurately by query parsers
  • A new collector "DocValuesStatsCollector" gives the ability to compute statistics on DocValues field
  • It is now possible to filter "SortedDocValues" and "SortedSetDocValues" terms enum with a compiled automaton
  • The "UnifiedHighlighter" can now highlight fields with queries that don't necessarily refer to that field
  • DrillSideways can now run queries concurrently
  • Index sorting now supports sorting on multi-valued fields using MIN, MAX, etc. selectors
  • Points do not store the implicit split dimension in the 1-dimension case. This saves between 6% memory for the largest types such an InetAddressPoint to 33% for the smaller types such as HalfFloatPoint.
  • The BKD in-memory index for dimensional points now uses a compressed format, using substantially less RAM in some cases
  • The BKD writing now buffers each leaf block in heap before writing to disk, giving a small speedup in points-heavy use cases
  • "TermAutomatonQuery" now rewrites to more efficient queries when possible

New in Apache Lucene 6.3.0 (Nov 8, 2016)

  • Highlights:
  • A brand new "UnifiedHighlighter" derivative of the PostingsHighlighter that can consume offsets from postings, term vectors, or analysis. It can highlight phrases as accurately as the standard Highlighter. Light term vectors can be used with offsets in postings for fast wildcard (MultiTermQuery) highlighting.
  • SimpleQueryParser now parses '*' to MatchAllDocsQuery.
  • FuzzyQuery now matches all terms within the specified edit distance, even if they are short terms.
  • Points do not store the implicit split dimension in the 1-dimension case. This saves between 6% memory for the largest types such an InetAddressPoint to 33% for the smaller types such as HalfFloatPoint.
  • Many other changes and bug fixes.

New in Apache Lucene 6.2.1 (Sep 20, 2016)

  • Highlights:
  • LUCENE-7417: The standard Highlighter could throw an !IllegalArgumentException when trying to highlight a query containing a degenerate case of a !MultiPhraseQuery with one term.
  • LUCENE-7440: Document id skipping (!PostingsEnum.advance) could throw an !ArrayIndexOutOfBoundsException exception on large index segments (>1.8B docs) with large skips.
  • LUCENE-7318: Fix backwards compatibility issues around StandardAnalyzer and its components, introduced with Lucene 6.2.0. The moved classes were restored in their original packages: LowercaseFilter and StopFilter, as well as several utility classes.

New in Apache Lucene 6.2.0 (Aug 26, 2016)

  • The CREATE_NEW flag is passed when creating a file to ensure Lucene is really write-once
  • Index numeric ranges (min and max value in a single field) and search by overlapping range
  • IndexWriter methods return a sequence number indicating effective order of operations across threads
  • UkrainianMorfologikAnalyzer is a new dictionary based analyzer for the Ukrainian language
  • The Polygon class can now be created from a GeoJSON string
  • Compound file creation now verifies checksum of its component files
  • Index time sorting is now a core feature, and supports dimensional points
  • StandardAnalyzer is moved to core and is the default analyzer
  • MatchNoDocsQuery now includes the reason it was created
  • QueryParser can now be told to not pre-split on whitespace
  • MMapDirectory tries harder to prevent SIGSEGV if buggy code tries to execute searches after the index was closed, but it's still best effort
  • MMapDirectory no longer allocates weak references to ease garbage collection
  • Conjunction (MUST, FILTER) queries are faster
  • Dimensional points have much faster (~40%) flush time and use less space in the index

New in Apache Lucene 6.1.0 (Jun 27, 2016)

  • New features:
  • Numerous improvements to LatLonPoint, for indexing a latitude/longitude point and searching by polygon, distance or box, or finding nearest neighbors
  • Geo3D now has simple APIs for creating common shape queries, matching LatLonPoint
  • Optimizations:
  • Faster indexing and searching of points.
  • Faster geo-spatial indexing and searching for LatLonPoint, Geo3D and GeoPoint (see http://home.apache.org/~mikemccand/geobench.html )
  • HardlinkCopyDirectoryWrapper optimizes file copies using hard links
  • In case of contention, the query cache now prefers returning an uncached Scorer rather than waiting on a lock.
  • Bug fixes:
  • BooleanQuery could sometimes assign too low scores to ranges of documents that matched a single clause.
  • Doc values updates could sometimes be applied in the wrong order.

New in Apache Lucene 5.3.1 (Oct 2, 2015)

  • Bug Fixes:
  • Remove classloader hack in MorfologikFilter
  • UsageTrackingQueryCachingPolicy no longer caches trivial queries like MatchAllDocsQuery
  • Fixed BoostingQuery to rewrite wrapped queries

New in Apache Lucene 5.3.0 (Aug 25, 2015)

  • New features:
  • LUCENE-6485: Add CustomSeparatorBreakIterator to postings highlighter which splits on any character. For example, it can be used with getMultiValueSeparator render whole field values.
  • LUCENE-6459: Add common suggest API that mirrors Lucene's Query/IndexSearcher APIs for Document based suggester. Adds PrefixCompletionQuery, RegexCompletionQuery, FuzzyCompletionQuery and ContextQuery.
  • LUCENE-6487: Spatial Geo3D API now has a WGS84 ellipsoid world model option.
  • LUCENE-6477: Add experimental BKD geospatial tree doc values format and queries, for fast "bbox/polygon contains lat/lon points"
  • LUCENE-6526: Asserting(Query|Weight|Scorer) now ensure scores are not computed if they are not needed.
  • LUCENE-6481: Add GeoPointField, GeoPointInBBoxQuery, GeoPointInPolygonQuery for simple "indexed lat/lon point in bbox/shape" searching.
  • LUCENE-5954: The segments_N commit point now stores the Lucene version that wrote the commit as well as the lucene version that wrote the oldest segment in the index, for faster checking of "too old" indices
  • LUCENE-6519: BKDPointInPolygonQuery is much faster by avoiding the per-hit polygon check when a leaf cell is fully contained by the polygon.
  • LUCENE-6549: Add preload option to MMapDirectory.
  • LUCENE-6504: Add Lucene53Codec, with norms implemented directly via the Directory's RandomAccessInput api.
  • LUCENE-6539: Add new DocValuesNumbersQuery, to match any document containing one of the specified long values. This change also moves the existing DocValuesTermsQuery and DocValuesRangeQuery to Lucene's sandbox module, since in general these queries are quite slow and are only fast in specific cases.
  • LUCENE-6577: Give earlier and better error message for invalid CRC.
  • LUCENE-6544: Geo3D: (1) Regularize path & polygon construction, (2) add PlanetModel.surfaceDistance() (ellipsoidal calculation), (3) cache lat & lon in GeoPoint, (4) add thread-safety where missing -- Geo3dShape.
  • LUCENE-6606: SegmentInfo.toString now confesses how the documents were sorted, when SortingMergePolicy was used
  • LUCENE-6524: IndexWriter can now be initialized from an already open near-real-time or non-NRT reader.
  • LUCENE-6578: Geo3D can now compute the distance from a point to a shape, both inner distance and to an outside edge. Multiple distance algorithms are available.
  • LUCENE-6632: Geo3D: Compute circle planes more accurately.
  • LUCENE-6653: Added general purpose BytesTermAttribute to basic token attributes package that can be used for TokenStreams that solely produce binary terms.
  • LUCENE-6365: Add Operations.topoSort, to run topological sort of the states in an Automaton
  • LUCENE-6365: Replace Operations.getFiniteStrings with a more scalable iterator API (FiniteStringsIterator)
  • LUCENE-6589: Add a new org.apache.lucene.search.join.CheckJoinIndex class that can be used to validate that an index has an appropriate structure to run join queries.
  • LUCENE-6659: Remove IndexWriter's unnecessary hard limit on max concurrency
  • LUCENE-6547: Add GeoPointDistanceQuery, matching all points within the specified distance from the center point. Fix GeoPointInBBoxQuery to handle dateline crossing.
  • LUCENE-6694: Add LithuanianAnalyzer and LithuanianStemmer.
  • LUCENE-6695: Added a new BlendedTermQuery to blend statistics across several terms.
  • LUCENE-6706: Added a new PayloadScoreQuery that generalises the behaviour of PayloadTermQuery and PayloadNearQuery to all Span queries.
  • LUCENE-6697: Add experimental range tree doc values format and queries, based on a 1D version of the spatial BKD tree, for a faster and smaller alternative to postings-based numeric and binary term filtering. Range trees can also handle values larger than 64 bits.
  • LUCENE-6647: Add GeoHash string utility APIs
  • LUCENE-6710: GeoPointField now uses full 64 bits (up from 62) to encode lat/lon
  • LUCENE-6580: SpanNearQuery now allows defined-width gaps in its subqueries
  • LUCENE-6712: Use doc values to post-filter GeoPointField hits that fall in boundary cells, resulting in smaller index, faster searches and less heap used for each query
  • API Changes:
  • LUCENE-6508: Simplify Lock api, there is now just Directory.obtainLock() which returns a Lock that can be released (or fails with exception). Add lock verification to IndexWriter. Improve exception messages when locking fails.
  • LUCENE-6529: Removed an optimization in UninvertingReader that was causing incorrect results for Numeric fields using precisionStep
  • LUCENE-6551: Add missing ConcurrentMergeScheduler.getAutoIOThrottle getter
  • LUCENE-6552: Add MergePolicy.OneMerge.getMergeInfo and rename setInfo to setMergeInfo
  • LUCENE-6525: Deprecate IndexWriterConfig's writeLockTimeout.
  • LUCENE-6466: Moved SpanQuery.getSpans() and .extractTerms() to SpanWeight
  • LUCENE-6371, LUCENE-6490: Payload collection from Spans is moved to a more generic SpanCollector framework. Spans no longer implements .hasPayload() and .getPayload() methods, and instead exposes a collect() method that allows the collection of arbitrary postings information. SpanPayloadCheckQuery and SpanPayloadNearCheckQuery have moved from the .spans package to the .payloads package.
  • LUCENE-6583: FilteredQuery is deprecated and will be removed in 6.0. It should be replaced with a BooleanQuery which handle the query as a MUST clause and the filter as a FILTER clause.
  • LUCENE-6553: The postings, spans and scorer APIs no longer take an acceptDocs parameter. Live docs are now always checked on top of these APIs.
  • LUCENE-6634: PKIndexSplitter now takes a Query instead of a Filter to decide how to split an index.
  • LUCENE-6643: GroupingSearch from lucene/grouping was changed to take a Query object to define groups instead of a Filter.
  • LUCENE-6554: ToParentBlockJoinFieldComparator was removed because of a bug with missing values that could not be fixed. ToParentBlockJoinSortField now works with string or numeric doc values selectors. Sorting on anything else than a string or numeric field would require to implement a custom selector.
  • LUCENE-6648: All lucene/facet APIs now take Query objects where they used to take Filter objects.
  • LUCENE-6640: Suggesters now take a BitsProducer object instead of a Filter object to reduce the scope of doc IDs that may be returned, emphasizing the fact that these objects need to support random-access.
  • LUCENE-6646: Make EarlyTerminatingCollector take a Sort object directly instead of a SortingMergePolicy.
  • LUCENE-6649: BitDocIdSetFilter and BitDocIdSetCachingWrapperFilter are now deprecated in favour of BitSetProducer and QueryBitSetProducer, which do not extend oal.search.Filter.
  • LUCENE-6607: Factor out geo3d into its own spatial3d module.
  • LUCENE-6531: PhraseQuery is now immutable and can be built using the PhraseQuery.Builder class.
  • LUCENE-6570: BooleanQuery is now immutable and can be built using the BooleanQuery.Builder class.
  • LUCENE-6702: NRTSuggester: Add a method to inject context values at index time in ContextSuggestField. Simplify ContextQuery logic for extracting contexts and add dedicated method to consider all context values at query time.
  • LUCENE-6719: NumericUtils getMinInt, getMaxInt, getMinLong, getMaxLong now return null if there are no terms for the specified field, previously these methods returned primitive values and raised an undocumented NullPointerException if there were no terms for the field.
  • Bug fixes:
  • LUCENE-6500: ParallelCompositeReader did not always call closed listeners. This was fixed by LUCENE-6501.
  • LUCENE-6520: Geo3D GeoPath.done() would throw an NPE if adjacent path segments were co-linear.
  • LUCENE-5805: QueryNodeImpl.removeFromParent was doing nothing in a costly manner
  • LUCENE-6533: SlowCompositeReaderWrapper no longer caches its live docs instance since this can prevent future improvements like a disk-backed live docs
  • LUCENE-6558: Highlighters now work with CustomScoreQuery
  • (Cao Manh Dat via Mike McCandless)
  • LUCENE-6560: BKDPointInBBoxQuery now handles "dateline crossing" correctly
  • LUCENE-6564: Change PrintStreamInfoStream to use thread safe Java 8 ISO-8601 date formatting (in Lucene 5.x use Java 7 FileTime#toString as workaround); fix output of tests to use same format.
  • LUCENE-6593: Fixed ToChildBlockJoinQuery's scorer to not refuse to advance to a document that belongs to the parent space.
  • LUCENE-6591: Never write a negative vLong
  • LUCENE-6588: Fix how ToChildBlockJoinQuery deals with acceptDocs.
  • LUCENE-6597: Geo3D's GeoCircle now supports a world-globe diameter.
  • LUCENE-6608: Fix potential resource leak in BigramDictionary.
  • LUCENE-6614: Improve partition detection in IOUtils#spins() so it works with NVMe drives.
  • LUCENE-6586: Fix typo in GermanStemmer, causing possible wrong value for substCount.
  • LUCENE-6658: Fix IndexUpgrader to also upgrade indexes without any segments.
  • LUCENE-6677: QueryParserBase fails to enforce maxDeterminizedStates when creating a WildcardQuery
  • LUCENE-6680: Preserve two suggestions that have same key and weight but different payloads
  • LUCENE-6681: SortingMergePolicy must override MergePolicy.size(...).
  • LUCENE-6682: StandardTokenizer performance bug: scanner buffer is unnecessarily copied when maxTokenLength doesn't change. Also stop silently maxing out buffer size (and effectively also max token length) at 1M chars, but instead throw an exception from setMaxTokenLength() when the given length is greater than 1M chars.
  • LUCENE-6696: Fix FilterDirectoryReader.close() to never close the underlying reader several times.
  • LUCENE-6334: FastVectorHighlighter failed to highlight phrases across more than one value in a multi-valued field.
  • LUCENE-6704: GeoPointDistanceQuery was visiting too many term ranges, consuming too much heap for a large radius
  • SOLR-5882: fix ScoreMode.Min at ToParentBlockJoinQuery
  • LUCENE-6718: JoinUtil.createJoinQuery failed to rewrite queries before creating a Weight.
  • LUCENE-6713: TooComplexToDeterminizeException claims to be serializable but wasn't
  • LUCENE-6723: Fix date parsing problems in Java 9 with date formats using English weekday/month names.
  • LUCENE-6618: Properly set MMapDirectory.UNMAP_SUPPORTED when it is now allowed by security policy.
  • Changes in runtime behavior:
  • LUCENE-6501: The subreader structure in ParallelCompositeReader was flattened, because the current implementation had too many hidden bugs regarding refounting and close listeners. If you create a new ParallelCompositeReader, it will just take all leaves of the passed readers and form a flat structure of ParallelLeafReaders instead of trying to assemble the original structure of composite and leaf readers.
  • LUCENE-6538: Also include java.vm.version and java.runtime.version in per-segment diagnostics
  • LUCENE-6537: NearSpansOrdered no longer tries to minimize its Span matches. This means that the matching algorithm is entirely lazy. All spans returned by the previous implementation are still reported, but matching documents may now also return additional spans that were previously discarded in preference to shorter overlapping ones.
  • LUCENE-6569: Optimize MultiFunction.anyExists and allExists to eliminate excessive array creation in common 2 argument usage
  • LUCENE-2880: Span queries now score more consistently with regular queries.
  • LUCENE-6601: FilteredQuery now always rewrites to a BooleanQuery which handles the query as a MUST clause and the filter as a FILTER clause. LEAP_FROG_QUERY_FIRST_STRATEGY and LEAP_FROG_FILTER_FIRST_STRATEGY do not guarantee anymore which iterator will be advanced first, it will depend on the respective costs of the iterators. QUERY_FIRST_FILTER_STRATEGY and RANDOM_ACCESS_FILTER_STRATEGY still consume the filter using its random-access API, however the returned bits may be called on different documents compared to before.
  • LUCENE-6542: FSDirectory's ctor now works with security policies or file systems that restrict write access.
  • LUCENE-6651: The default implementation of AttributeImpl#reflectWith(AttributeReflector) now uses AccessControler#doPrivileged() to do the reflection. Please consider implementing this method in all your custom attributes, because the method will be made abstract in Lucene 6.
  • LUCENE-6639: LRUQueryCache and CachingWrapperQuery now consider a query as "used" when the first Scorer is pulled instead of when a Scorer is pulled on the first segment on an index.
  • LUCENE-6579: IndexWriter now sacrifices (closes) itself to protect the index when an unexpected, tragic exception strikes while merging.
  • LUCENE-6691: SortingMergePolicy.isSorted now considers FilterLeafReader instances. EarlyTerminatingSortingCollector.terminatedEarly accessor added. TestEarlyTerminatingSortingCollector.testTerminatedEarly test added.
  • LUCENE-6609: Add getSortField impls to many subclasses of FieldCacheSource which return the most direct SortField implementation. In many trivial sort by ValueSource usages, this will result in less RAM, and more precise sorting of extreme values due to no longer converting to double.
  • Optimizations:
  • LUCENE-6548: Some optimizations for BlockTree's intersect with very finite automata
  • LUCENE-6585: Flatten conjunctions and conjunction approximations into parent conjunctions. For example a sloppy phrase query of "foo bar"~5 with a filter of "baz" will internally leapfrog foo,bar,baz as one conjunction.
  • LUCENE-6325: Reduce RAM usage of FieldInfos, and speed up lookup by number, by using an array instead of TreeMap except in very sparse cases
  • LUCENE-6617: Reduce heap usage for small FSTs
  • LUCENE-6616: IndexWriter now lists the files in the index directory only once on init, and IndexFileDeleter no longer suppresses FileNotFoundException and NoSuchFileException. This also improves IndexFileDeleter to delete segments_N files last, so that in the presence of a virus checker, the index is never left in a state where an expired segments_N references non-existing files
  • LUCENE-6645: Optimized the way we merge postings lists in multi-term queries and TermsQuery. This should especially help when there are lots of small postings lists.
  • LUCENE-6668: Optimized storage for sorted set and sorted numeric doc values in the case that there are few unique sets of values.
  • LUCENE-6690: Sped up MultiTermsEnum.next() on high-cardinality fields.
  • LUCENE-6621: Removed two unused variables in analysis/stempel/src/java/org/ egothor/stemmer/Compile.java
  • Build:
  • LUCENE-6518: Don't report false thread leaks from IBM J9 ClassCache Reaper in test framework.
  • LUCENE-6567: Simplify payload checking in SpanPayloadCheckQuery
  • LUCENE-6568: Make rat invocation depend on ivy configuration being set up
  • LUCENE-6683: ivy-fail goal directs people to non-existent page
  • LUCENE-6693: Updated Groovy to 2.4.4, Pegdown to 1.5, Svnkit to 1.8.10. Also fixed some PermGen errors while running full build caused by these updates: Tasks are now installed from root's build.xml.
  • LUCENE-6741: Fix jflex files to regenerate the java files correctly.
  • Test Framework:
  • LUCENE-6637: Fix FSTTester to not violate file permissions on -Dtests.verbose=true.
  • LUCENE-6542: LuceneTestCase now has runWithRestrictedPermissions() to run an action with reduced permissions. This can be used to simulate special environments (e.g., read-only dirs). If tests are running without a security manager, an assume cancels test execution automatically.
  • LUCENE-6652: Removed lots of useless Byte(s)TermAttributes all over test infrastructure.
  • LUCENE-6563: Improve MockFileSystemTestCase.testURI to check if a path can be encoded according to local filesystem requirements. Otherwise stop test execution.
  • Changes in backwards compatibility policy:
  • LUCENE-6553: The iterator returned by the LeafReader.postings method now always includes deleted docs, so you have to check for deleted documents on top of the iterator.
  • LUCENE-6633: DuplicateFilter has been deprecated and will be removed in 6.0. DiversifiedTopDocsCollector can be used instead with a maximum number of hits per key equal to 1.
  • LUCENE-6653: The workflow for consuming the TermToBytesRefAttribute was changed: getBytesRef() now does all work and is called on each token, fillBytesRef() was removed. The implementation is free to reuse the internal BytesRef or return a new one on each call.
  • LUCENE-6682: StandardTokenizer.setMaxTokenLength() now throws an exception if a length greater than 1M chars is given. Previously the effective max token length (the scanner's buffer) was capped at 1M chars, but getMaxTokenLength() incorrectly returned the previously requested length, even when it exceeded 1M.

New in Apache Lucene 5.2.1 (Jun 15, 2015)

  • Bug LUCENE-6482: Class loading deadlock relating to Codec initialization, default codec and SPI discovery
  • Bug LUCENE-6559: TimeLimitingCollector should check timeout also when LeafCollector is pulled
  • Bug LUCENE-6527: TermWeight should not load norms when needsScores is false
  • Bug LUCENE-6482: Class loading deadlock relating to Codec initialization, default codec and SPI discovery

New in Apache Lucene 5.2.0 (Jun 9, 2015)

  • Span queries now share document conjunction/intersection code with boolean queries, and use two-phased iterators for faster intersection by avoiding loading positions in certain cases.
  • Added two-phase support to SpanNotQuery, and SpanPositionCheckQuery and its subclasses: SpanPositionRangeQuery, SpanPayloadCheckQuery, SpanNearPayloadCheckQuery, SpanFirstQuery.
  • Added a new query time join to the join module that uses global ordinals, which is faster for subsequent joins between reopens.
  • New CompositeSpatialStrategy combines speed of RPT with accuracy of SDV. Includes optimized Intersect predicate to avoid many geometry checks. Uses TwoPhaseIterator.
  • New LimitTokenOffsetFilter that limits tokens to those before a configured maximum start offset.
  • New spatial PackedQuadPrefixTree, a generally more efficient choice than QuadPrefixTree, especially for high precision shapes. When used, you should typically disable RPT's pruneLeafyBranches option.
  • Expressions now support bindings keys that look like zero arg functions
  • Add SpanWithinQuery and SpanContainingQuery that return spans inside of / containing another spans.
  • New Spatial "Geo3d" API with partial Spatial4j integration. It is a set of shapes implemented using 3D planar geometry for calculating spatial relations on the surface of a sphere. Shapes include Point, BBox, Circle, Path (buffered line string), and Polygon.
  • Various bugfixes and optimizations since the 5.1.0 release.

New in Apache Lucene 5.1.0 (Apr 15, 2015)

  • New Features:
  • LUCENE-6066: Added DiversifiedTopDocsCollector to misc for collecting no more than a given number of results under a choice of key. Introduces new remove method to core's PriorityQueue.
  • LUCENE-6191: New spatial 2D heatmap faceting for PrefixTreeStrategy.
  • LUCENE-6227: Added BooleanClause.Occur.FILTER to filter documents without participating in scoring (on the contrary to MUST).
  • LUCENE-6294: Added oal.search.CollectorManager to allow for parallelization of the document collection process on IndexSearcher.
  • LUCENE-6303: Added filter caching baked into IndexSearcher, disabled by default.
  • LUCENE-6304: Added a new MatchNoDocsQuery that matches no documents.
  • LUCENE-6341: Add a -fast option to CheckIndex.
  • LUCENE-6355: IndexWriter's infoStream now also logs time to write FieldInfos during merge
  • LUCENE-6339: Added Near-real time Document Suggester via custom postings format
  • Bug Fixes:
  • LUCENE-6368: FST.save can truncate output (BufferedOutputStream may be closed after the underlying stream).
  • LUCENE-6249: StandardQueryParser doesn't support pure negative clauses.
  • LUCENE-6190: Spatial pointsOnly flag on PrefixTreeStrategy shouldn't switch all predicates to Intersects.
  • LUCENE-6242: Ram usage estimation was incorrect for SparseFixedBitSet when object alignment was different from 8.
  • LUCENE-6293: Fixed TimSorter bug.
  • LUCENE-6001: DrillSideways hits NullPointerException for certain BooleanQuery searches.
  • LUCENE-6311: Fix NIOFSDirectory and SimpleFSDirectory so that the toString method of IndexInputs confess when they are from a compound file.
  • LUCENE-6381: Add defensive wait time limit in DocumentsWriterStallControl to prevent hangs during indexing if we miss a .notify/All somewhere
  • LUCENE-6386: Correct IndexWriter.forceMerge documentation to state that up to 3X (X = current index size) spare disk space may be needed to complete forceMerge(1).
  • LUCENE-6395: Seeking by term ordinal was failing to set the term's bytes in MemoryIndex
  • Optimizations:
  • LUCENE-6183, LUCENE-5647: Avoid recompressing stored fields and term vectors when merging segments without deletions. Lucene50Codec's BEST_COMPRESSION mode uses a higher deflate level for more compact storage.
  • LUCENE-6184: Make BooleanScorer only score windows that contain matches.
  • LUCENE-6161: Speed up resolving of deleted terms to docIDs by doing a combined merge sort between deleted terms and segment terms instead of a separate merge sort for each segment. In delete-heavy use cases this can be a sizable speedup.
  • LUCENE-6201: BooleanScorer can now deal with values of minShouldMatch that are greater than one and is used when queries produce dense result sets.
  • LUCENE-6218: Don't decode frequencies or match all positions when scoring is not needed.
  • LUCENE-6233 Speed up CheckIndex when the index has term vectors
  • LUCENE-6198: Added the TwoPhaseIterator API, exposed on scorers which is for now only used on phrase queries and conjunctions in order to check positions lazily if the phrase query is in a conjunction with other queries.
  • LUCENE-6244, LUCENE-6251: All boolean queries but those that have a minShouldMatch > 1 now either propagate or take advantage of the two-phase iteration capabilities added in LUCENE-6198.
  • LUCENE-6241: FSDirectory.listAll() doesnt filter out subdirectories anymore, for faster performance. Subdirectories don't matter to Lucene. If you need to filter out non-index files with some custom usage, you may want to look at the IndexFileNames class.
  • LUCENE-6262: ConstantScoreQuery does not wrap the inner weight anymore when scores are not required.
  • LUCENE-6263: MultiCollector automatically caches scores when several collectors need them.
  • LUCENE-6275: SloppyPhraseScorer now uses the same logic as ConjunctionScorer in order to advance doc IDs, which takes advantage of the cost() API.
  • LUCENE-6290: QueryWrapperFilter propagates approximations and FilteredQuery rewrites to a BooleanQuery when the filter is a QueryWrapperFilter in order to leverage approximations.
  • LUCENE-6318: Reduce RAM usage of FieldInfos when there are many fields.
  • LUCENE-6320: Speed up CheckIndex.
  • LUCENE-4942: Optimized the encoding of PrefixTreeStrategy indexes for non-point data: 33% smaller index, 68% faster indexing, and 44% faster searching. YMMV
  • API Changes:
  • LUCENE-6204, LUCENE-6208: Simplify CompoundFormat: remove files() and remove files parameter to write().
  • LUCENE-6217: Add IndexWriter.isOpen and getTragicException.
  • LUCENE-6218, LUCENE-6220: Add Collector.needsScores() and needsScores parameter to Query.createWeight().
  • LUCENE-4524, LUCENE-6246, LUCENE-6256, LUCENE-6271: Merge DocsEnum and DocsAndPositionsEnum into a single PostingsEnum iterator. TermsEnum.docs() and TermsEnum.docsAndPositions() are replaced by TermsEnum.postings().
  • LUCENE-6222: Removed TermFilter, use a QueryWrapperFilter(TermQuery) instead. This will be as efficient now that queries can opt out from scoring.
  • LUCENE-6269: Removed BooleanFilter, use a QueryWrapperFilter(BooleanQuery) instead.
  • LUCENE-6270: Replaced TermsFilter with TermsQuery, use a QueryWrapperFilter(TermsQuery) instead.
  • LUCENE-6223: Move BooleanQuery.BooleanWeight to BooleanWeight.
  • LUCENE-1518: Make Filter extend Query and return 0 as score.
  • LUCENE-6245: Force Filter subclasses to implement toString API from Query.
  • LUCENE-6268: Replace FieldValueFilter and DocValuesRangeFilter with equivalent queries that support approximations.
  • LUCENE-6289: Replace DocValuesRangeFilter with DocValuesRangeQuery which supports approximations.
  • LUCENE-6266: Remove unnecessary Directory params from SegmentInfo.toString, SegmentInfos.files/toString, and SegmentCommitInfo.toString.
  • LUCENE-6272: Scorer extends DocSetIdIterator rather than DocsEnum
  • LUCENE-6281: Removed support for slow collations from lucene/sandbox. Better performance would be achieved through CollationKeyAnalyzer or ICUCollationKeyAnalyzer.
  • LUCENE-6286: Removed IndexSearcher methods that take a Filter object. A BooleanQuery with a filter clause must be used instead.
  • LUCENE-6300: PrefixFilter, TermRangeFilter and NumericRangeFilter have been removed. Use PrefixQuery, TermRangeQuery and NumericRangeQuery instead.
  • LUCENE-6303: Replaced FilterCache with QueryCache and CachingWrapperFilter with CachingWrapperQuery.
  • LUCENE-6317: Deprecate DataOutput.writeStringSet and writeStringStringMap. Use writeSetOfStrings/Maps instead.
  • LUCENE-6307: Rename SegmentInfo.getDocCount -> .maxDoc, SegmentInfos.totalDocCount -> .totalMaxDoc, MergeInfo.totalDocCount > .totalMaxDoc and MergePolicy.OneMerge.totalDocCount -> .totalMaxDoc
  • LUCENE-6367: PrefixQuery now subclasses AutomatonQuery, removing the specialized PrefixTermsEnum.
  • Other:
  • LUCENE-6248: Remove unused odd constants from StandardSyntaxParser.jj
  • LUCENE-6193: Collapse identical catch branches in try-catch statements.
  • LUCENE-6239: Removed RAMUsageEstimator's sun.misc.Unsafe calls.
  • LUCENE-6292: Seed StringHelper better.
  • LUCENE-6333: Refactored queries to delegate their equals and hashcode impls to the super class.
  • LUCENE-6343: DefaultSimilarity javadocs had the wrong float value to demonstrate precision of encoded norms
  • Changes in Runtime Behavior:
  • LUCENE-6255: PhraseQuery now ignores leading holes and requires that positions are positive and added in order.
  • LUCENE-6298: SimpleQueryParser returns an empty query rather than null, if e.g. the terms were all stopwords.

New in Apache Lucene 5.0.0 (Feb 24, 2015)

  • NEW FEATURES:
  • LUCENE-5945: All file handling converted to NIO.2 apis. (Robert Muir)
  • LUCENE-5946: SimpleFSDirectory now uses Files.newByteChannel, for portability with custom FileSystemProviders. If you want the old non-interruptible behavior of RandomAccessFile, use RAFDirectory in the misc/ module. (Uwe Schindler, Robert Muir)
  • SOLR-3359: Added analyzer attribute/property to SynonymFilterFactory. (Ryo Onodera via Koji Sekiguchi)
  • LUCENE-5648: Index and search date ranges, particularly multi-valued ones. It's implemented in the spatial module as DateRangePrefixTree used with NumberRangePrefixTreeStrategy. (David Smiley)
  • LUCENE-5895: Lucene now stores a unique id per-segment and per-commit to aid in accurate replication of index files. (Robert Muir, Mike McCandless)
  • LUCENE-5889: Add commit method to AnalyzingInfixSuggester, and allow just using .add to build up the suggester. (Varun Thacker via Mike McCandless)
  • LUCENE-5123: Add a "pull" option to the postings writing API, so that a PostingsFormat now receives a Fields instance and it is responsible for iterating through all fields, terms, documents and positions. (Robert Muir, Mike McCandless)
  • LUCENE-5268: Full cutover of all postings formats to the "pull" FieldsConsumer API, removing PushFieldsConsumer. Added new PushPostingsWriterBase for single-pass push of docs/positions to the postings format. (Mike McCandless)
  • LUCENE-5906: Use Files.delete everywhere instead of File.delete, so that when things go wrong, you get a real exception message why. (Uwe Schindler, Robert Muir)
  • LUCENE-5933: Added FilterSpans for easier wrapping of Spans instance. (Shai Erera)
  • LUCENE-5925: Remove fallback logic from opening commits, instead use Directory.renameFile so that in-progress commits are never visible. (Robert Muir)
  • LUCENE-5820: SuggestStopFilter should have a factory. (Varun Thacker via Steve Rowe)
  • LUCENE-5949: Add Accountable.getChildResources(). (Robert Muir)
  • SOLR-5986: Added ExitableDirectoryReader that extends FilterDirectoryReader and enables exiting requests that take too long to enumerate over terms. (Anshum Gupta, Steve Rowe, Robert Muir)
  • LUCENE-5911: Add MemoryIndex.freeze() to allow thread-safe searching over a MemoryIndex. (Alan Woodward, David Smiley, Robert Muir)
  • LUCENE-5969: Lucene 5.0 has a new index format with mismatched file detection, improved exception handling, and indirect norms encoding for sparse fields. (Mike McCandless, Ryan Ernst, Robert Muir)
  • LUCENE-6053: Add Serbian analyzer. (Nikola Smolenski via Robert Muir, Mike McCandless)
  • LUCENE-4400: Add support for new NYSIIS Apache commons phonetic codec. (Thomas Neidhart via Mike McCandless)
  • LUCENE-6059: Add Daitch-Mokotoff Soundex phonetic Apache commons phonetic codec, and upgrade to Apache commons codec 1.10. (Thomas Neidhart via Mike McCandless)
  • LUCENE-6058: With the upgrade to Apache commons codec 1.10, the experimental BeiderMorseFilter has changed its behavior, so any index using it will need to be rebuilt. (Thomas Neidhart via Mike McCandless)
  • LUCENE-6050: Accept MUST and MUST_NOT (in addition to SHOULD) for each context passed to Analyzing/BlendedInfixSuggester. (Arcadius Ahouansou, jane chang via Mike McCandless)
  • LUCENE-5929: Also extract terms to highlight from block join queries. (Julie Tibshirani via Mike McCandless)
  • LUCENE-6063: Allow overriding whether/how ConcurrentMergeScheduler stalls incoming threads when merges are falling behind. (Mike McCandless)
  • LUCENE-5833: DocumentDictionary now enumerates each value separately in a multi-valued field (not just the first value), so you can build suggesters from multi-valued fields. (Varun Thacker via Mike McCandless)
  • LUCENE-6077: Added a filter cache. (Adrien Grand, Robert Muir)
  • LUCENE-6088: TermsFilter implements Accountable. (Adrien Grand)
  • LUCENE-6034: The default highlighter when used with QueryScorer will highlight payload-sensitive queries provided that term vectors with positions, offsets, and payloads are present. This is the only highlighter that can highlight such queries accurately. (David Smiley)
  • LUCENE-5914: Add an option to Lucene50Codec to support either BEST_SPEED or BEST_COMPRESSION for stored fields. (Adrien Grand, Robert Muir)
  • LUCENE-6119: Add auto-IO-throttling to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate. (Mike McCandless)
  • LUCENE-6155: Add payload support to MemoryIndex. The default highlighter's QueryScorer and WeighedSpanTermExtractor now have setUsePayloads(bool). (David Smiley)
  • LUCENE-6166: Deletions (alone) can now trigger new merges. (Mike McCandless)
  • LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories. (Uwe Schindler)
  • OPTIMIZATIONS:
  • LUCENE-5960: Use a more efficient bitset, not a Set, to track visited states. (Markus Heiden via Mike McCandless)
  • LUCENE-5959: Don't allocate excess memory when building automaton in finish. (Markus Heiden via Mike McCandless)
  • LUCENE-5963: Reduce memory allocations in AnalyzingSuggester. (Markus Heiden via Mike McCandless)
  • LUCENE-5938: MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE is now faster on queries that match few documents by using a sparse bit set implementation. (Adrien Grand)
  • LUCENE-5969: Refactor merging to be more efficient, checksum calculation is per-segment/per-producer, and norms and doc values merging no longer cause RAM spikes for latent fields. (Mike McCandless, Robert Muir)
  • LUCENE-5983: CachingWrapperFilter now uses a new DocIdSet implementation called RoaringDocIdSet instead of WAH8DocIdSet. (Adrien Grand)
  • LUCENE-6022: DocValuesDocIdSet checks live docs before doc values. (Adrien Grand)
  • LUCENE-6030: Add norms patched compression for a small number of common values. (Ryan Ernst)
  • LUCENE-6040: Speed up EliasFanoDocIdSet through broadword bit selection. (Paul Elschot)
  • LUCENE-6033: CachingTokenFilter now uses ArrayList not LinkedList, and has new isCached() method. (David Smiley)
  • LUCENE-6031: TokenSources (in the default highlighter) converts term vectors into a TokenStream much faster in linear time (not N*log(N) using less memory, and with reset() implemented. Only one of offsets or positions are required of the term vector. (David Smiley)
  • LUCENE-6089, LUCENE-6090: Tune CompressionMode.HIGH_COMPRESSION for better compression and less cpu usage. (Adrien Grand, Robert Muir)
  • LUCENE-6034: QueryScorer, used by the default highlighter, needn't re-index the provided TokenStream with MemoryIndex when it comes from TokenSources (term vectors) with offsets and positions. (David Smiley)
  • LUCENE-5951: ConcurrentMergeScheduler detects whether the index is on SSD or not and does a better job defaulting its settings. This only works on Linux for now; other OS's will continue to use the previous defaults (tuned for spinning disks). (Robert Muir, Uwe Schindler, hossman, Mike McCandless)
  • LUCENE-6131: Optimize SortingMergePolicy. (Robert Muir)
  • LUCENE-6133: Improve default StoredFieldsWriter.merge() to be more efficient. (Robert Muir)
  • LUCENE-6145: Make EarlyTerminatingSortingCollector able to early-terminate when the sort order is a prefix of the index-time order. (Adrien Grand)
  • LUCENE-6178: Score boolean queries containing MUST_NOT clauses with BooleanScorer2, to use skip list data and avoid unnecessary scoring. (Adrien Grand, Robert Muir)
  • API CHANGES:
  • LUCENE-5900: Deprecated more constructors taking Version in *InfixSuggester and ICUCollationKeyAnalyzer, and removed TEST_VERSION_CURRENT from the test framework. (Ryan Ernst)
  • LUCENE-4535: oal.util.FilterIterator is now an internal API. (Adrien Grand)
  • LUCENE-4924: DocIdSetIterator.docID() must now return -1 when the iterator is not positioned. This change affects all classes that inherit from DocIdSetIterator, including DocsEnum and DocsAndPositionsEnum. (Adrien Grand)
  • LUCENE-5127: Reduce RAM usage of FixedGapTermsIndex. Remove IndexWriterConfig.setTermIndexInterval, IndexWriterConfig.setReaderTermsIndexDivisor, and termsIndexDivisor from StandardDirectoryReader. These options have been no-ops with the default codec since Lucene 4.0. If you want to configure the interval for this term index, pass it directly in your codec, where it can also be configured per-field. (Robert Muir)
  • LUCENE-5388: Remove Reader from Tokenizer's constructor and from Analyzer's createComponents. TokenStreams now always get their input via setReader. (Benson Margulies via Robert Muir - pull request #16)
  • LUCENE-5527: The Collector API has been refactored to use a dedicated Collector per leaf. (Shikhar Bhushan, Adrien Grand)
  • LUCENE-5702: The FieldComparator API has been refactor to a per-leaf API, just like Collectors. (Adrien Grand)
  • LUCENE-4246: IndexWriter.close now always closes, even if it throws an exception. The new IndexWriterConfig.setCommitOnClose (default true) determines whether close() should commit before closing.
  • LUCENE-5608, LUCENE-5565: Refactor SpatialPrefixTree/Cell API. Doesn't use Strings as tokens anymore, and now iterates cells on-demand during indexing instead of building a collection. RPT now has more setters. (David Smiley)
  • LUCENE-5666: Change uninverted access (sorting, faceting, grouping, etc) to use the DocValues API instead of FieldCache. For FieldCache functionality, use UninvertingReader in lucene/misc (or implement your own FilterReader). UninvertingReader is more efficient: supports multi-valued numeric fields, detects when a multi-valued field is single-valued, reuses caches of compatible types (e.g. SORTED also supports BINARY and SORTED_SET access without insanity). "Insanity" is no longer possible unless you explicitly want it. Rename FieldCache* and DocTermOrds* classes in the search package to DocValues*. Move SortedSetSortField to core and add SortedSetFieldSource to queries/, which takes the same selectors. Add helper methods to DocValues.java that are better suited for search code (never return null, etc). (Mike McCandless, Robert Muir)
  • LUCENE-5871: Remove Version from IndexWriterConfig. Use IndexWriterConfig.setCommitOnClose to change the behavior of IndexWriter.close(). The default has been changed to match that of 4.x. (Ryan Ernst, Mike McCandless)
  • LUCENE-5965: CorruptIndexException requires a String or DataInput resource. (Robert Muir)
  • LUCENE-5972: IndexFormatTooOldException and IndexFormatTooNewException now extend from IOException. (Ryan Ernst, Robert Muir)
  • LUCENE-5569: *AtomicReader/AtomicReaderContext have been renamed to *LeafReader/LeafReaderContext. (Ryan Ernst)
  • LUCENE-5938: Removed MultiTermQuery.ConstantScoreAutoRewrite as MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE is usually better. (Adrien Grand)
  • LUCENE-5924: Rename CheckIndex -fix option to -exorcise. This option does not actually fix the index, it just drops data. (Robert Muir)
  • LUCENE-5969: Add Codec.compoundFormat, which handles the encoding of compound files. Add getMergeInstance() to codec producer APIs, which can be overridden to return an instance optimized for merging instead of searching. Add Terms.getStats() which can return additional codec-specific statistics about a field. Change instance method SegmentInfos.read() to two static methods: SegmentInfos.readCommit() and SegmentInfos.readLatestCommit(). (Mike McCandless, Robert Muir)
  • LUCENE-5992: Remove FieldInfos from SegmentInfosWriter.write API. (Robert Muir, Mike McCandless)
  • LUCENE-5998: Simplify Field/SegmentInfoFormat to read+write methods. (Robert Muir)
  • LUCENE-6000: Removed StandardTokenizerInterface. Tokenizers now use their jflex impl directly. (Ryan Ernst)
  • LUCENE-6006: Removed FieldInfo.normType since it's redundant: it will be DocValuesType.NUMERIC if the field indexed and does not omit norms, else null. (Robert Muir, Mike McCandless)
  • LUCENE-6013: Removed indexed boolean from IndexableFieldType and FieldInfo, since it's redundant with IndexOptions != null. (Robert Muir, Mike McCandless)
  • LUCENE-6021: FixedBitSet.nextSetBit now returns DocIdSetIterator.NO_MORE_DOCS instead of -1 when there are no more bits which are set. (Adrien Grand)
  • LUCENE-5953: Directory and LockFactory APIs were restructured: Locking is now under the responsibility of the Directory implementation. LockFactory is only used by subclasses of BaseDirectory to delegate locking to an impl class. LockFactories are now singletons and are responsible to create a Lock instance based on a Directory implementation passed to the factory method. See MIGRATE.txt for more details. (Uwe Schindler, Robert Muir)
  • LUCENE-6062: Throw exception instead of silently doing nothing if you try to sort/group/etc on a misconfigured field (e.g. no docvalues, no UninvertingReader, etc). (Robert Muir)
  • LUCENE-6068: LeafReader.fields() never returns null. (Robert Muir)
  • LUCENE-6082: Remove abort() from codec apis. (Robert Muir)
  • LUCENE-6084: IndexOutput's constructor now requires a String resourceDescription so its toString is sane. (Robert Muir, Mike McCandless)
  • LUCENE-6087: Allow passing custom DirectoryReader to SearcherManager. (Mike McCandless)
  • LUCENE-6085: Undeprecate SegmentInfo attributes, but add safety so they won't be trappy if codec tries to use them during docvalues updates. (Robert Muir)
  • LUCENE-6097: Remove dangerous / overly expert IndexWriter.abortMerges and waitForMerges methods. (Robert Muir, Mike McCandless)
  • LUCENE-6099: Add FilterDirectory.unwrap and FilterDirectoryReader.unwrap. (Simon Willnauer, Mike McCandless)
  • LUCENE-6121: CachingTokenFilter.reset() now propagates to its input if called before incrementToken(). You must call reset() now on this filter instead of doing it a-priori on the input(), which previously didn't work. (David Smiley, Robert Muir)
  • LUCENE-6147: Make the core Accountables.namedAccountable function public. (Ryan Ernst)
  • LUCENE-6150: Remove staleFiles set and onIndexOutputClosed() from FSDirectory. (Uwe Schindler, Robert Muir, Mike McCandless)
  • LUCENE-6146: Replaced Directory.copy() with Directory.copyFrom(). (Robert Muir)
  • LUCENE-6149: Infix suggesters' highlighting and allTermsRequired can be set at the constructor for non-contextual lookup. (Boon Low, Tomás Fernández Löbbe)
  • LUCENE-6158, LUCENE-6165: IndexWriter.addIndexes(IndexReader...) changed to addIndexes(CodecReader...). (Robert Muir)
  • LUCENE-6179: Out-of-order scoring is not allowed anymore, so Weight.scoresDocsOutOfOrder and LeafCollector.acceptsDocsOutOfOrder have been removed and boolean queries now always score in order.
  • LUCENE-6212: IndexWriter no longer accepts per-document Analyzer to add/updateDocument. These methods were trappy as they made it easy to accidentally index tokens that were not easily searchable. (Mike McCandless)
  • BUG FIXES:
  • LUCENE-5650: Enforce read-only access to any path outside the temporary folder via security manager, and make test temp dirs absolute. (Ryan Ernst, Dawid Weiss)
  • LUCENE-5948: RateLimiter now fully inits itself on init. (Varun Thacker via Mike McCandless)
  • LUCENE-5981: CheckIndex obtains write.lock, since with some parameters it may modify the index, and to prevent false corruption reports, as it does not have the regular "spinlock" of DirectoryReader.open. It now implements Closeable and you must close it to release the lock. (Mike McCandless, Robert Muir)
  • LUCENE-6004: Don't highlight the LookupResult.key returned from AnalyzingInfixSuggester. (Christian Reuschling, jane chang via Mike McCandless)
  • LUCENE-5980: Don't let document length overflow. (Robert Muir)
  • LUCENE-5999: Fix backcompat support for StandardTokenizer. (Ryan Ernst)
  • LUCENE-5961: Fix the exists() method for FunctionValues returned by many ValueSources to behave properly when wrapping other ValueSources which do not exist for the specified document. (hossman)
  • LUCENE-6039: Add IndexOptions.NONE and DocValuesType.NONE instead of using null to mean not index and no doc values, renamed IndexOptions.DOCS_ONLY to DOCS, and pulled IndexOptions and DocValues out of FieldInfo into their own classes in org.apache.lucene.index. (Simon Willnauer, Robert Muir, Mike McCandless)
  • LUCENE-6043: Fix backcompat support for UAX29URLEmailTokenizer. (Ryan Ernst)
  • LUCENE-6041: Remove sugar methods FieldInfo.isIndexed and FieldInfo.hasDocValues. (Robert Muir, Mike McCandless)
  • LUCENE-6044: Fix backcompat support for token filters with enablePositionIncrements=false. Also fixed backcompat for TrimFilter with updateOffsets=true. These options are supported with a match version before 4.4, and no longer valid at all with 5.0. (Ryan Ernst)
  • LUCENE-6042: CustomScoreQuery explain was incorrect in some cases, such as when nested inside a boolean query. (Denis Lantsman via Robert Muir)
  • LUCENE-6046: Add maxDeterminizedStates safety to determinize (which has an exponential worst case) so that if it would create too many states, it now throws an exception instead of exhausting CPU/RAM. (Nik Everett via Mike McCandless)
  • LUCENE-6054: Allow repeating the empty automaton. (Nik Everett via Mike McCandless)
  • LUCENE-6049: Don't throw cryptic exception writing a segment when the only docs in it had fields that hit non-aborting exceptions during indexing but also had doc values. (Mike McCandless)
  • LUCENE-6055: PayloadAttribute.clone() now does a deep clone of the underlying bytes. (Shai Erera)
  • LUCENE-6060: Remove dangerous IndexWriter.unlock method. (Simon Willnauer, Mike McCandless)
  • LUCENE-6062: Pass correct fieldinfos to docvalues producer when the segment has updates. (Mike McCandless, Shai Erera, Robert Muir)
  • LUCENE-6075: Don't overflow int in SimpleRateLimiter. (Boaz Leskes via Mike McCandless)
  • LUCENE-5987: IndexWriter will now forcefully close itself on aborting exception (an exception that would otherwise cause silent data loss). (Robert Muir, Mike McCandless)
  • LUCENE-6094: Allow IW.rollback to stop ConcurrentMergeScheduler even when it's stalling because there are too many merges. (Mike McCandless)
  • LUCENE-6105: Don't cache FST root arcs if the number of root arcs is small, or if the cache would be > 20% of the size of the FST. (Robert Muir, Mike McCandless)
  • LUCENE-6124: Fix double-close() problems in codec and store APIs. (Robert Muir)
  • LUCENE-6152: Fix double close problems in OutputStreamIndexOutput. (Uwe Schindler)
  • LUCENE-6139: Highlighter: TokenGroup start & end offset getters should have been returning the offsets of just the matching tokens in the group when there's a distinction. (David Smiley)
  • LUCENE-6173: NumericTermAttribute and spatial/CellTokenStream do not clone their BytesRef(Builder)s. Also equals/hashCode was missing. (Uwe Schindler)
  • LUCENE-6205: Fixed intermittent concurrency issue that could cause FileNotFoundException when writing doc values updates at the same time that a merge kicks off. (Mike McCandless)
  • LUCENE-6192: Fix int overflow corruption case in skip data for high frequency terms in extremely large indices. (Robert Muir, Mike McCandless)
  • LUCENE-6093: Don't throw NullPointerException from BlendedInfixSuggester for lookups that do not end in a prefix token. (jane chang via Mike McCandless)
  • LUCENE-6214: Fixed IndexWriter deadlock when one thread is committing while another opens a near-real-time reader and an unrecoverable (tragic) exception is hit. (Simon Willnauer, Mike McCandless)
  • DOCUMENTATION:
  • LUCENE-5392: Add/improve analysis package documentation to reflect analysis API changes. (Benson Margulies via Robert Muir - pull request #17)
  • LUCENE-6057: Improve Sort(SortField) docs. (Martin Braun via Mike McCandless)
  • LUCENE-6112: Fix compile error in FST package example code. (Tomoko Uchida via Koji Sekiguchi)
  • TESTS:
  • LUCENE-5957: Add option for tests to not randomize codec. (Ryan Ernst)
  • LUCENE-5974: Add check that backcompat indexes use default codecs. (Ryan Ernst)
  • LUCENE-5971: Create addBackcompatIndexes.py script to build and add backcompat test indexes for a given lucene version. Also renamed backcompat index files to use Version.toString() in filename. (Ryan Ernst)
  • LUCENE-6002: Monster tests no longer fail. Most of them now have an 80 hour timeout, effectively removing the timeout. The tests that operate near the 2 billion limit now use IndexWriter.MAX_DOCS instead of Integer.MAX_VALUE. Some of the slow Monster tests now explicitly choose the default codec. (Mike McCandless, Shawn Heisey)
  • LUCENE-5968: Improve error message when 'ant beast' is run on top-level modules. (Ramkumar Aiyengar, Uwe Schindler)
  • LUCENE-6120: Fix MockDirectoryWrapper's close() handling. (Mike McCandless, Robert Muir)
  • BUILD:
  • LUCENE-5909: Smoke tester now has better command line parsing and optionally also runs on Java 8. (Ryan Ernst, Uwe Schindler)
  • LUCENE-5902: Add bumpVersion.py script to manage version increase after release branch is cut.
  • LUCENE-5962: Rename diffSources.py to createPatch.py and make it work with all text file types. (Ryan Ernst)
  • LUCENE-5995: Upgrade ICU to 54.1. (Robert Muir)
  • LUCENE-6070: Upgrade forbidden-apis to 1.7. (Uwe Schindler)
  • OTHER:
  • LUCENE-5563: Removed sep layout: which has fallen behind on features and doesn't perform as well as other options. (Robert Muir)
  • LUCENE-4086: Removed support for Lucene 3.x indexes. See migration guide for more information. (Robert Muir)
  • LUCENE-5858: Moved Lucene 4 compatibility codecs to 'lucene-backward-codecs.jar'. (Adrien Grand, Robert Muir)
  • LUCENE-5915: Remove Pulsing postings format. (Robert Muir)
  • LUCENE-6213: Add useful exception message when commit contains segments from legacy codecs. (Ryan Ernst)

New in Apache Lucene 4.10.3 (Dec 24, 2014)

  • Bug fixes:
  • LUCENE-6046: Add maxDeterminizedStates safety to determinize (which has an exponential worst case) so that if it would create too many states, it now throws an exception instead of exhausting CPU/RAM.
  • LUCENE-6054: Allow repeating the empty automaton
  • LUCENE-6049: Don't throw cryptic exception writing a segment when the only docs in it had fields that hit non-aborting exceptions during indexing but also had doc values.
  • LUCENE-6060: Deprecate IndexWriter.unlock
  • LUCENE-3229: Overlapping ordered SpanNearQuery spans should not match.
  • LUCENE-6004: Don't highlight the LookupResult.key returned from AnalyzingInfixSuggester
  • LUCENE-6075: Don't overflow int in SimpleRateLimiter
  • LUCENE-5980: Don't let document length overflow.
  • LUCENE-6042: CustomScoreQuery explain was incorrect in some cases, such as when nested inside a boolean query.
  • LUCENE-5948: RateLimiter now fully inits itself on init.
  • LUCENE-6055: PayloadAttribute.clone() now does a deep clone of the underlying bytes.
  • LUCENE-6094: Allow IW.rollback to stop ConcurrentMergeScheduler even when it's stalling because there are too many merges.
  • Documentation:
  • LUCENE-6057: Improve Sort(SortField) docs

New in Apache Lucene 4.10.2 (Nov 10, 2014)

  • Bug fixes:
  • LUCENE-5977: Fix tokenstream safety checks in IndexWriter to properly work across multi-valued fields. Previously some cases across multi-valued fields would happily create a corrupt index.
  • LUCENE-6019: Detect when DocValuesType illegally changes for the same field name. Also added -Dtests.asserts=true|false so we can run tests with and without assertions.

New in Apache Lucene 4.10.1 (Sep 30, 2014)

  • Bug fixes:
  • LUCENE-5934: Fix backwards compatibility for 4.0 indexes.
  • LUCENE-5939: Regenerate old backcompat indexes to ensure they were built with the exact release
  • LUCENE-5952: Improve error messages when version cannot be parsed; don't check for too old or too new major version (it's too low level to enforce here); use simple string tokenizer.
  • LUCENE-5958: Don't let exceptions during checkpoint corrupt the index. Refactor existing OOM handling too, so you don't need to handle OOM special for every IndexWriter method: instead such disasters will cause IW to close itself defensively.
  • LUCENE-5904: Fixed a corruption case that can happen when 1) IndexWriter is uncleanly shut-down (OS crash, power loss, etc.), 2) on startup, when a new IndexWriter is created, a virus checker is holding some of the previously written but unused files open and preventing deletion, 3) IndexWriter writes these files again during the course of indexing, then the files can later be deleted, causing corruption. This case was detected by adding evilness to MockDirectoryWrapper to have it simulate a virus checker holding a file open and preventing deletion
  • LUCENE-5916: Static scope test components should be consistent between tests (and test iterations). Fix for FaultyIndexInput in particular.
  • LUCENE-5975: Fix reading of 3.0-3.3 indexes, where bugs in these old index formats would result in CorruptIndexException "did not read all bytes from file" when reading the deleted docs file.
  • Tests:
  • LUCENE-5936: Add backcompat checks to verify what is tested matches known versions

New in Apache Lucene 4.10.0 (Sep 4, 2014)

  • New Features:
  • LUCENE-5778: Support hunspell morphological description fields/aliases. (Robert Muir)
  • LUCENE-5801: Added (back) OrdinalMappingAtomicReader for merging search indexes that contain category ordinals from separate taxonomy indexes. (Nicola Buso via Shai Erera)
  • LUCENE-4175, LUCENE-5714, LUCENE-5779: Index and search rectangles with spatial BBoxSpatialStrategy using most predicates. Sort documents by relative overlap of query areas or just by indexed shape area. (Ryan McKinley, David Smiley)
  • LUCENE-5806: Extend expressions grammar to support array access in variables. Added helper class VariableContext to parse complex variable into pieces. (Ryan Ernst)
  • LUCENE-5826: Support proper hunspell case handling, LANG, KEEPCASE, NEEDAFFIX, and ONLYINCOMPOUND flags. (Robert Muir)
  • LUCENE-5815: Add TermAutomatonQuery, a proximity query allowing you to create an arbitrary automaton, using terms on the transitions, expressing which sequence of sequential terms (including a special "any" term) are allowed. This is a generalization of MultiPhraseQuery and span queries, and enables "correct" (including position) length search-time graph synonyms. (Mike McCandless)
  • LUCENE-5819: Add OrdsLucene41 block tree terms dict and postings format, to include term ordinals in the index so the optional TermsEnum.ord() and TermsEnum.seekExact(long ord) APIs work. (Mike McCandless)
  • LUCENE-5835: TermValComparator can sort missing values last. (Adrien Grand)
  • LUCENE-5825: Benchmark module can use custom postings format, e.g.: codec.postingsFormat=Memory. (Varun Shenoy, David Smiley)
  • LUCENE-5842: When opening large files (where its to expensive to compare checksum against all the bytes), retrieve checksum to validate structure of footer, this can detect some forms of corruption such as truncation. (Robert Muir)
  • LUCENE-5739: Added DataInput.readZ(Int|Long) and DataOutput.writeZ(Int|Long) to read and write small signed integers. (Adrien Grand)
  • API Changes:
  • LUCENE-5752: Simplified Automaton API to be immutable. (Mike McCandless)
  • LUCENE-5793: Add equals/hashCode to FieldType. (Shay Banon, Robert Muir)
  • LUCENE-5692: DisjointSpatialFilter is deprecated (used by RecursivePrefixTreeStrategy). (David Smiley)
  • LUCENE-5771: SpatialOperation's predicate names are now aliased to OGC standard names. Thus you can use: Disjoint, Equals, Intersects, Overlaps, Within, Contains, Covers, CoveredBy. The area requirement on the predicates was removed, and Overlaps' definition was fixed. (David Smiley)
  • LUCENE-5850: Made Version handling more robust and extensible. Deprecated Constants.LUCENE_MAIN_VERSION, Constants.LUCENE_VERSION and current Version constants of the form LUCENE_X_Y. Added version constants that include bugfix number of form LUCENE_X_Y_Z. Changed Version.LUCENE_CURRENT to Version.LATEST. CheckIndex now prints the Lucene version used to write each segment. (Ryan Ernst, Uwe Schindler, Robert Muir, Mike McCandless)
  • LUCENE-5836: BytesRef has been splitted into BytesRef, whose intended usage is to be just a reference to a section of a larger byte[] and BytesRefBuilder which is a StringBuilder-like class for BytesRef instances. (Adrien Grand)
  • LUCENE-5883: You can now change the MergePolicy instance on a live IndexWriter, without first closing and reopening the writer. This allows to e.g. run a special merge with UpgradeIndexMergePolicy without reopening the writer. Also, MergePolicy no longer implements Closeable; if you need to release your custom MegePolicy's resources, you need to implement close() and call it explicitly. (Shai Erera)
  • LUCENE-5859: Deprecate Analyzer constructors taking Version. Use Analyzer.setVersion() to set the version an analyzer to replicate behavior from a specific release. (Ryan Ernst, Robert Muir)
  • Optimizations:
  • LUCENE-5780: Make OrdinalMap more memory-efficient, especially in case the first segment has all values. (Adrien Grand, Robert Muir)
  • LUCENE-5782: OrdinalMap now sorts enums before being built in order to improve compression. (Adrien Grand)
  • LUCENE-5798: Optimize MultiDocsEnum reuse. (Robert Muir)
  • LUCENE-5799: Optimize numeric docvalues merging. (Robert Muir)
  • LUCENE-5797: Optimize norms merging. (Adrien Grand, Robert Muir)
  • LUCENE-5803: Add DelegatingAnalyzerWrapper, an optimized variant of AnalyzerWrapper that doesn't allow to wrap components or readers. This wrapper class is the base class of all analyzers that just delegate to another analyzer, e.g. per field name: PerFieldAnalyzerWrapper and Solr's schema support. (Shay Banon, Uwe Schindler, Robert Muir)
  • LUCENE-5795: MoreLikeThisQuery now only collects the top N terms instead of collecting all terms from the like text when building the query. (Alex Ksikes, Simon Willnauer)
  • LUCENE-5681: Fix RAMDirectory's IndexInput to not do double buffering on slices (causes useless data copying, especially on random access slices). This also improves slices of NRTCachingDirectory, because the cache is based on RAMDirectory. BufferedIndexInput.wrap() was marked with a warning in javadocs. It is almost always a better idea to implement slicing on your own. (Uwe Schindler, Robert Muir)
  • LUCENE-5834: Empty sorted set and numeric doc values are now singletons. (Adrien Grand)
  • LUCENE-5841: Improve performance of block tree terms dictionary when assigning terms to blocks. (Mike McCandless)
  • LUCENE-5856: Optimize Fixed/Open/LongBitSet to remove unnecessary AND. (Robert Muir)
  • LUCENE-5884: Optimize FST.ramBytesUsed. (Adrien Grand, Robert Muir, Mike McCandless)
  • LUCENE-5882: Add Lucene410DocValuesFormat, with faster term lookups for SORTED/SORTED_SET fields. (Robert Muir)
  • LUCENE-5887: Remove WeakIdentityMap caching in AttributeFactory, AttributeSource, and VirtualMethod in favour of Java 7's ClassValue. Always use MethodHandles to create AttributeImpl classes. (Uwe Schindler)
  • Bug Fixes:
  • LUCENE-5796: Fixes the Scorer.getChildren() method for two combinations of BooleanQuery. (Terry Smith via Robert Muir)
  • LUCENE-5790: Fix compareTo in MutableValueDouble and MutableValueBool, this caused incorrect results when grouping on fields with missing values. (海老澤 志信, hossman)
  • LUCENE-5817: Fix hunspell zero-affix handling: previously only zero-strips worked correctly. (Robert Muir)
  • LUCENE-5818, LUCENE-5823: Fix hunspell overgeneration for short strings that also match affixes, words are only stripped to a zero-length string if FULLSTRIP option is specifed in the dictionary. (Robert Muir)
  • LUCENE-5824: Fix hunspell 'long' flag handling. (Robert Muir)
  • LUCENE-5827: Make all Directory implementations correctly fail with IllegalArgumentException if slices are out of bounds. (Uwe SChindler)
  • LUCENE-5838: Fix hunspell when the .aff file has over 64k affixes. (Robert Muir)
  • LUCENE-5844: ArrayUtil.grow/oversize now returns a maximum of Integer.MAX_VALUE - 8 for the maximum array size. (Robert Muir, Mike McCandless)
  • LUCENE-5843: Added IndexWriter.MAX_DOCS which is the maximum number of documents allowed in a single index, and any operations that add documents will now throw IllegalStateException if the max count would be exceeded, instead of silently creating an unusable index. (Mike McCandless)
  • LUCENE-5869: Added restriction to positive values for maxExpansions in FuzzyQuery. (Ryan Ernst)
  • LUCENE-5672: IndexWriter.addIndexes() calls maybeMerge(), to ensure the index stays healthy. If you don't want merging use NoMergePolicy instead. (Robert Muir)
  • LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (Chris Geeringh, Robert Muir, Steve Rowe)
  • LUCENE-5907: Fix corruption case when opening a pre-4.x index with IndexWriter, then opening an NRT reader from that writer, then calling commit from the writer, then closing the NRT reader. This case would remove the wrong files from the index leading to a corrupt index. (Mike McCandless)
  • LUCENE-5908: Fix Lucene43NGramTokenizer to be final
  • Test Framework:
  • LUCENE-5786: Unflushed/ truncated events file (hung testing subprocess). (Dawid Weiss)
  • LUCENE-5881: Add "beasting" of tests: repeats the whole "test" Ant target N times with "ant beast -Dbeast.iters=N". (Uwe Schindler, Robert Muir, Ryan Ernst, Dawid Weiss)
  • Build:
  • LUCENE-5770: Upgrade to JFlex 1.6, which has direct support for supplementary code points - as a result, ICU4J is no longer used to generate surrogate pairs to augment JFlex scanner specifications. (Steve Rowe)
  • SOLR-6358: Remove VcsDirectoryMappings from idea configuration vcs.xml. (Ramkumar Aiyengar via Steve Rowe)

New in Apache Lucene 4.9.0 (Jun 26, 2014)

  • Changes in Runtime Behavior:
  • LUCENE-5611: Changing the term vector options for multiple field instances by the same name in one document is not longer accepted; IndexWriter will now throw IllegalArgumentException.
  • (Robert Muir, Mike McCandless)
  • LUCENE-5646: Remove rare/undertested bulk merge algorithm in CompressingStoredFieldsWriter.
  • (Robert Muir, Adrien Grand)
  • New Features:
  • LUCENE-5610: Add Terms.getMin and Terms.getMax to get the lowest and highest terms, and NumericUtils.get{Min/Max}{Int/Long} to get the minimum numeric values from the provided Terms.
  • (Robert Muir, Mike McCandless)
  • LUCENE-5675: Add IDVersionPostingsFormat, a postings format optimized for primary-key (ID) fields that also record a version (long) for each ID.
  • (Robert Muir, Mike McCandless)
  • LUCENE-5680: Add ability to atomically update a set of DocValues fields.
  • (Shai Erera)
  • LUCENE-5717: Add support for multiterm queries nested inside filtered and constant-score queries to postings highlighter.
  • (Luca Cavanna via Robert Muir)
  • LUCENE-5731, LUCENE-5760: Add RandomAccessInput, a random access API for directory. Add DirectReader/Writer, optimized for reading packed integers directly from Directory. Add Lucene49Codec and Lucene49DocValuesFormat that make use of these.
  • (Robert Muir)
  • LUCENE-5743: Add Lucene49NormsFormat, which can compress in some cases such as very short fields.
  • (Ryan Ernst, Adrien Grand, Robert Muir)
  • LUCENE-5748: Add SORTED_NUMERIC docvalues type, which is efficient for processing numeric fields with multiple values.
  • (Robert Muir)
  • LUCENE-5754: Allow "$" as part of variable and function names in expressions module.
  • (Uwe Schindler)
  • Changes in Backwards Compatibility Policy:
  • LUCENE-5634: Add reuse argument to IndexableField.tokenStream. This can be used by custom fieldtypes, which don't use the Analyzer, but implement their own TokenStream.
  • (Uwe Schindler, Robert Muir)
  • LUCENE-5640: AttributeSource.AttributeFactory was moved to a top-level class: org.apache.lucene.util.AttributeFactory
  • (Uwe Schindler, Robert Muir)
  • LUCENE-4371: Removed IndexInputSlicer and Directory.createSlicer() and replaced with IndexInput.slice().
  • (Robert Muir)
  • LUCENE-5727, LUCENE-5678: Remove IndexOutput.seek, IndexOutput.setLength().
  • (Robert Muir, Uwe Schindler)
  • API Changes (20)
  • LUCENE-5756: IndexWriter now implements Accountable and IW#ramSizeInBytes() has been deprecated infavor of IW#ramBytesUsed()
  • (Simon Willnauer)
  • LUCENE-5725: MoreLikeThis#like now accepts multiple values per field. The pre-existing method has been deprecated in favor of a variable arguments for the like text.
  • (Alex Ksikes via Simon Willnauer)
  • LUCENE-5711: MergePolicy accepts an IndexWriter instance on each method rather than holding state against a single IndexWriter instance.
  • (Simon Willnauer)
  • LUCENE-5582: Deprecate IndexOutput.length (just use IndexOutput.getFilePointer instead) and IndexOutput.setLength.
  • (Mike McCandless)
  • LUCENE-5621: Deprecate IndexOutput.flush: this is not used by Lucene.
  • (Robert Muir)
  • LUCENE-5611: Simplified Lucene's default indexing chain / APIs. AttributeSource/TokenStream.getAttribute now returns null if the attribute is not present (previously it threw IllegalArgumentException). StoredFieldsWriter.startDocument no longer receives the number of fields that will be added
  • (Robert Muir, Mike McCandless)
  • LUCENE-5632: In preparation for coming Lucene versions, the Version enum constants were renamed to make them better readable. The constant for Lucene 4.9 is now "LUCENE_4_9". Version.parseLeniently() is still able to parse the old strings ("LUCENE_49"). The old identifiers got deprecated and will be removed in Lucene 5.0.
  • (Uwe Schindler, Robert Muir)
  • LUCENE-5633: Change NoMergePolicy to a singleton with no distinction between compound and non-compound types.
  • (Shai Erera)
  • LUCENE-5640: The Token class was deprecated. Since Lucene 2.9, TokenStreams are using Attributes, Token is no longer used.
  • (Uwe Schindler, Robert Muir)
  • LUCENE-5679: Consolidated IndexWriter.deleteDocuments(Term) and IndexWriter.deleteDocuments(Query) with their varargs counterparts.
  • (Shai Erera)
  • LUCENE-5706: Removed the option to unset a DocValues field through DocValues updates.
  • (Shai Erera)
  • LUCENE-5700: Added oal.util.Accountable that is now implemented by all classes whose memory usage can be estimated.
  • (Robert Muir, Adrien Grand)
  • LUCENE-5708: Remove IndexWriterConfig.clone, so now IndexWriter simply uses the IndexWriterConfig you pass it, and you must create a new IndexWriterConfig for each IndexWriter.
  • (Mike McCandless)
  • LUCENE-5701: Core closed listeners are now available in the AtomicReader API, they used to sit only in SegmentReader.
  • (Adrien Grand, Robert Muir)
  • LUCENE-5678: IndexOutput no longer allows seeking, so it is no longer required to use RandomAccessFile to write Indexes. Lucene now uses standard FileOutputStream wrapped with OutputStreamIndexOutput to write index data. BufferedIndexOutput was removed, because buffering and checksumming is provided by FilterOutputStreams, provided by the JDK.
  • (Uwe Schindler, Mike McCandless)
  • LUCENE-5703: BinaryDocValues API changed to work like TermsEnum and not allocate/ copy bytes on each access, you are responsible for cloning if you want to keep data around.
  • (Adrien Grand)
  • LUCENE-5695: DocIdSet implements Accountable.
  • (Adrien Grand)
  • LUCENE-5757: Moved RamUsageEstimator's reflection-based processing to RamUsageTester in the test-framework module.
  • (Robert Muir)
  • LUCENE-5761: Removed DiskDocValuesFormat, it was very inefficient and saved very little RAM over the default codec.
  • (Robert Muir)
  • LUCENE-5775: Deprecate JaspellLookup.
  • (Mike McCandless)
  • Optimizations:
  • LUCENE-5603: hunspell stemmer more efficiently strips prefixes and suffixes.
  • (Robert Muir)
  • LUCENE-5599: HttpReplicator did not properly delegate bulk read() to wrapped InputStream.
  • (Christoph Kaser via Shai Erera)
  • LUCENE-5591: pass an IOContext with estimated flush size when applying DV updates.
  • (Shai Erera)
  • LUCENE-5634: IndexWriter reuses TokenStream instances for String and Numeric fields by default.
  • (Uwe Schindler, Shay Banon, Mike McCandless, Robert Muir)
  • LUCENE-5638, LUCENE-5640: TokenStream uses a more performant AttributeFactory by default, that packs the core attributes into one implementation (PackedTokenAttributeImpl), for faster clearAttributes(), saveState(), and restoreState(). In addition, AttributeFactory uses Java 7 MethodHandles for instantiating Attribute implementations.
  • (Uwe Schindler, Robert Muir)
  • LUCENE-5609: Changed the default NumericField precisionStep from 4 to 8 (for int/float) and 16 (for long/double), for faster indexing time and smaller indices.
  • (Robert Muir, Uwe Schindler, Mike McCandless)
  • LUCENE-5670: Add skip/FinalOutput to FST Outputs.
  • (Christian Ziech via Mike McCandless).
  • LUCENE-4236: Optimize BooleanQuery's in-order scoring. This speeds up some types of boolean queries.
  • (Robert Muir)
  • LUCENE-5694: Don't score() subscorers in DisjunctionSumScorer or DisjunctionMaxScorer unless score() is called.
  • (Robert Muir)
  • LUCENE-5720: Optimize DirectPackedReader's decompression.
  • (Robert Muir)
  • LUCENE-5722: Optimize ByteBufferIndexInput#seek() by specializing implementations. This improves random access as used by docvalues codecs if used with MMapDirectory.
  • (Robert Muir, Uwe Schindler)
  • LUCENE-5730: FSDirectory.open returns MMapDirectory for 64-bit operating systems, not just Linux and Windows.
  • (Robert Muir)
  • LUCENE-5703: BinaryDocValues producers don't allocate or copy bytes on each access anymore.
  • (Adrien Grand)
  • LUCENE-5721: Monotonic compression doesn't use zig-zag encoding anymore.
  • (Robert Muir, Adrien Grand)
  • LUCENE-5750: Speed up monotonic addressing for BINARY and SORTED_SET docvalues.
  • (Robert Muir)
  • LUCENE-5751: Speed up MemoryDocValues.
  • (Adrien Grand, Robert Muir)
  • LUCENE-5767: OrdinalMap optimizations, that mostly help on low cardinalities.
  • (Martijn van Groningen, Adrien Grand)
  • LUCENE-5769: SingletonSortedSetDocValues now supports random access ordinals.
  • (Robert Muir)
  • Bug fixes:
  • LUCENE-5738: Ensure NativeFSLock prevents opening the file channel for the lock if the lock is already obtained by the JVM. Trying to obtain an already obtained lock in the same JVM can unlock the file might allow other processes to lock the file even without explicitly unlocking the FileLock. This behavior is operating system dependent.
  • (Simon Willnauer)
  • LUCENE-5673: MMapDirectory: Work around a "bug" in the JDK that throws a confusing OutOfMemoryError wrapped inside IOException if the FileChannel mapping failed because of lack of virtual address space. The IOException is rethrown with more useful information about the problem, omitting the incorrect OutOfMemoryError.
  • (Robert Muir, Uwe Schindler)
  • LUCENE-5682: NPE in QueryRescorer when Scorer is null
  • (Joel Bernstein, Mike McCandless)
  • LUCENE-5691: DocTermOrds lookupTerm(BytesRef) would return incorrect results if the underlying TermsEnum supports ord() and the insertion point would be at the end.
  • (Robert Muir)
  • LUCENE-5618, LUCENE-5636: SegmentReader referenced unneeded files following doc-values updates. Now doc-values field updates are written in separate file per field.
  • (Shai Erera, Robert Muir)
  • LUCENE-5684: Make best effort to detect invalid usage of Lucene, when IndexReader is reopened after all files in its index were removed and recreated by the application (the proper way to do this is IndexWriter.deleteAll, or opening an IndexWriter with OpenMode.CREATE)
  • (Mike McCandless)
  • LUCENE-5704: Fix compilation error with Java 8u20.
  • (Uwe Schindler)
  • LUCENE-5710: Include the inner exception as the cause and in the exception message when an immense term is hit during indexing
  • (Lee Hinman via Mike McCandless)
  • LUCENE-5724: CompoundFileWriter was failing to pass through the IOContext in some cases, causing NRTCachingDirectory to cache compound files when it shouldn't, then causing OOMEs.
  • (Mike McCandless)
  • LUCENE-5747: Project-specific settings for the eclipse development environment will prevent automatic code reformatting.
  • (Shawn Heisey)
  • LUCENE-5768, LUCENE-5777: Hunspell condition checks containing character classes were buggy.
  • (Clinton Gormley, Robert Muir)
  • Test Framework:
  • LUCENE-5622: Fail tests if they print over the given limit of bytes to System.out or System.err.
  • (Robert Muir, Dawid Weiss)
  • LUCENE-5619: Added backwards compatibility tests to ensure we can update existing indexes with doc-values updates.
  • (Shai Erera, Robert Muir)
  • Build:
  • LUCENE-5442: The Ant check-lib-versions target now runs Ivy resolution transitively, then fails the build when it finds a version conflict: when a transitive dependency's version is more recent than the direct dependency's version specified in lucene/ivy-versions.properties. Exceptions are specifiable in lucene/ivy-ignore-conflicts.properties.
  • (Steve Rowe)
  • LUCENE-5715: Upgrade direct dependencies known to be older than transitive dependencies: com.sun.jersey.version:1.8->1.9; com.sun.xml.bind:jaxb-impl:2.2.2->2.2.3-1; commons-beanutils:commons-beanutils:1.7.0->1.8.3; commons-digester:commons-digester:2.0->2.1; commons-io:commons-io:2.1->2.3; commons-logging:commons-logging:1.1.1->1.1.3; io.netty:netty:3.6.2.Final->3.7.0.Final; javax.activation:activation:1.1->1.1.1; javax.mail:mail:1.4.1->1.4.3; log4j:log4j:1.2.16->1.2.17; org.apache.avro:avro:1.7.4->1.7.5; org.tukaani:xz:1.2->1.4; org.xerial.snappy:snappy-java:1.0.4.1->1.0.5
  • (Steve Rowe)

New in Apache Lucene 4.8.1 (May 20, 2014)

  • Bug fixes:
  • LUCENE-5639: Fix PositionLengthAttribute implementation in Token class.
  • LUCENE-5635: IndexWriter didn't properly handle IOException on TokenStream.reset(), which could leave the analyzer in an inconsistent state.
  • LUCENE-5599: HttpReplicator did not properly delegate bulk read() to wrapped InputStream.
  • LUCENE-5600: HttpClientBase did not properly consume a connection if a server error occurred.
  • LUCENE-5628: Change getFiniteStrings to iterative not recursive implementation, so that building suggesters on a long suggestion doesn't risk overflowing the stack; previously it consumed one Java stack frame per character in the expanded suggestion. If you are building a suggester this is a nasty trap.
  • LUCENE-5559: Add additional argument validation for CapitalizationFilter and CodepointCountFilter.
  • LUCENE-5641: SimpleRateLimiter would silently rate limit at 8 MB/sec even if you asked for higher rates.
  • LUCENE-5644: IndexWriter clears which threads use which internal thread states on flush, so that if an application reduces how many threads it uses for indexing, that results in a reduction of how many segments are flushed on a full-flush (e.g. to obtain a near-real-time reader).
  • LUCENE-5653: JoinUtil with ScoreMode.Avg on a multi-valued field with more than 256 values would throw exception.
  • LUCENE-5654: Fix various close() methods that could suppress throwables such as OutOfMemoryError, instead returning scary messages that look like index corruption.
  • LUCENE-5656: Fix rare fd leak in SegmentReader when multiple docvalues fields have been updated with IndexWriter.updateXXXDocValue and one hits exception.
  • LUCENE-5660: AnalyzingSuggester.build will now throw IllegalArgumentException if you give it a longer suggestion than it can handle
  • LUCENE-5662: Add missing checks to Field to prevent IndexWriter.abort if a stored value is null.
  • LUCENE-5668: Fix off-by-one in TieredMergePolicy
  • LUCENE-5671: Upgrade ICU version to fix an ICU concurrency problem that could cause exceptions when indexing.

New in Apache Lucene 4.8.0 (Apr 28, 2014)

  • System Requirements:
  • LUCENE-4747, LUCENE-5514: Move to Java 7 as minimum Java version.
  • Changes in Runtime Behavior:
  • LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH. To recreate previous behavior of silently ignoring these terms, use LengthFilter in your Analyzer.
  • New Features:
  • LUCENE-5356: Morfologik filter can accept custom dictionary resources.
  • LUCENE-5454: Add SortedSetSortField to lucene/sandbox, to allow sorting on multi-valued field.
  • LUCENE-5478: CommonTermsQuery now allows to create custom term queries similar to the query parser by overriding a newTermQuery method.
  • LUCENE-5477: AnalyzingInfixSuggester now supports near-real-time additions and updates (to change weight or payload of an existing suggestion).
  • LUCENE-5482: Improve default TurkishAnalyzer by adding apostrophe handling suitable for Turkish.
  • LUCENE-5479: FacetsConfig subclass can now customize the default per-dim facets configuration.
  • LUCENE-5485: Add circumfix support to HunspellStemFilter.
  • LUCENE-5224: Add iconv, oconv, and ignore support to HunspellStemFilter.
  • LUCENE-5493: SortingMergePolicy, and EarlyTerminatingSortingCollector support arbitrary Sort specifications.
  • LUCENE-3758: Allow the ComplexPhraseQueryParser to search order or un-order proximity queries.
  • LUCENE-5530: ComplexPhraseQueryParser throws ParseException for fielded queries.
  • LUCENE-5513: Add IndexWriter.updateBinaryDocValue which lets you update the value of a BinaryDocValuesField without reindexing the document(s).
  • LUCENE-4072: Add ICUNormalizer2CharFilter, which lets you do unicode normalization with offset correction before the tokenizer.
  • LUCENE-5476: Add RandomSamplingFacetsCollector for computing facets on a sampled set of matching hits, in cases where there are millions of hits.
  • LUCENE-4984: Add SegmentingTokenizerBase, abstract class for tokenizers that want to do two-pass tokenization such as by sentence and then by word.
  • LUCENE-5489: Add Rescorer/QueryRescorer, to resort the hits from a first pass search using scores from a more costly second pass search.
  • LUCENE-5528: Add context to suggesters (InputIterator and Lookup classes), and fix AnalyzingInfixSuggester to handle contexts. Suggester contexts allow you to filter suggestions.
  • LUCENE-5545: Add SortRescorer and Expression.getRescorer, to resort the hits from a first pass search using a Sort or an Expression.
  • LUCENE-5558: Add TruncateTokenFilter which truncates terms to the specified length.
  • LUCENE-2446: Added checksums to lucene index files. As of 4.8, the last 8 bytes of each file contain a zlib-crc32 checksum. Small metadata files are verified on load. Larger files can be checked on demand via AtomicReader.checkIntegrity. You can configure this to happen automatically before merges by enabling IndexWriterConfig.setCheckIntegrityAtMerge.
  • LUCENE-5580: Checksums are automatically verified on the default stored fields format when performing a bulk merge.
  • LUCENE-5602: Checksums are automatically verified on the default term vectors format when performing a bulk merge.
  • LUCENE-5583: Added DataInput.skipBytes. ChecksumIndexInput can now seek, but only forward.
  • LUCENE-5588: Lucene now calls fsync() on the index directory, ensuring that all file metadata is persisted on disk in case of power failure. This does not work on all file systems and operating systems, but Linux and MacOSX are known to work. On Windows, fsyncing a directory is not possible with Java APIs.
  • API Changes:
  • LUCENE-5454: Add RandomAccessOrds, an optional extension of SortedSetDocValues that supports random access to the ordinals in a document.
  • LUCENE-5468: Move offline Sort (from suggest module) to OfflineSort.
  • LUCENE-5493: SortingMergePolicy and EarlyTerminatingSortingCollector take Sort instead of Sorter. BlockJoinSorter is removed, replaced with BlockJoinComparatorSource, which can take a Sort for ordering of parents and a separate Sort for ordering of children within a block.
  • LUCENE-5516: MergeScheduler#merge() now accepts a MergeTrigger as well as a boolean that indicates if a new merge was found in the caller thread before the scheduler was called.
  • LUCENE-5487: Separated bulk scorer (new Weight.bulkScorer method) from normal scoring (Weight.scorer) for those queries that can do bulk scoring more efficiently, e.g. BooleanQuery in some cases. This also simplified the Weight.scorer API by removing the two confusing booleans.
  • LUCENE-5519: TopNSearcher now allows to retrieve incomplete results if the max size of the candidate queue is unknown. The queue can still be bound in order to apply pruning while retrieving the top N but will not throw an exception if too many results are rejected to guarantee an absolutely correct top N result. The TopNSearcher now returns a struct like class that indicates if the result is complete in the sense of the top N or not. Consumers of this API should assert on the completeness if the bounded queue size is know ahead of time.
  • LUCENE-4984: Deprecate ThaiWordFilter and smartcn SentenceTokenizer and WordTokenFilter. These filters would not work correctly with CharFilters and could not be safely placed at an arbitrary position in the analysis chain. Use ThaiTokenizer and HMMChineseTokenizer instead.
  • LUCENE-5543: Remove/deprecate Directory.fileExists
  • LUCENE-5573: Move docvalues constants and helper methods to o.a.l.index.DocValues.
  • LUCENE-5604: Switched BytesRef.hashCode to MurmurHash3 (32 bit). TermToBytesRefAttribute.fillBytesRef no longer returns the hash code. BytesRefHash now uses MurmurHash3 for its hashing.
  • Optimizations:
  • LUCENE-5468: HunspellStemFilter uses 10 to 100x less RAM. It also loads all known openoffice dictionaries without error, and supports an additional longestOnly option for a less aggressive approach.
  • LUCENE-4848: Use Java 7 NIO2-FileChannel instead of RandomAccessFile for NIOFSDirectory and MMapDirectory. This allows to delete open files on Windows if NIOFSDirectory is used, mmapped files are still locked.
  • LUCENE-5515: Improved TopDocs#merge to create a merged ScoreDoc array with length of at most equal to the specified size instead of length equal to at most from + size as was before.
  • LUCENE-5529: Spatial search of non-point indexed shapes should be a little faster due to skipping intersection tests on redundant cells.
  • Bug fixes:
  • LUCENE-5483: Fix inaccuracies in HunspellStemFilter. Multi-stage affix-stripping, prefix-suffix dependencies, and COMPLEXPREFIXES now work correctly according to the hunspell algorithm. Removed recursionCap parameter, as its no longer needed, rules for recursive affix application are driven correctly by continuation classes in the affix file.
  • LUCENE-5497: HunspellStemFilter properly handles escaped terms and affixes without conditions.
  • LUCENE-5505: HunspellStemFilter ignores BOM markers in dictionaries and handles varying types of whitespace in SET/FLAG commands.
  • LUCENE-5507: Fix HunspellStemFilter loading of dictionaries with large amounts of aliases etc before the encoding declaration.
  • LUCENE-5111: Fix WordDelimiterFilter to return offsets in correct order.
  • LUCENE-5555: Fix SortedInputIterator to correctly encode/decode contexts in presence of payload
  • LUCENE-5559: Add missing argument checks to tokenfilters taking numeric arguments.
  • LUCENE-5568: Benchmark module's "default.codec" option didn't work.
  • SOLR-5983: HTMLStripCharFilter is treating CDATA sections incorrectly.
  • LUCENE-5615: Validate per-segment delete counts at write time, to help catch bugs that might otherwise cause corruption
  • LUCENE-5612: NativeFSLockFactory no longer deletes its lock file. This cannot be done safely without the risk of deleting someone else's lock file. If you use NativeFSLockFactory, you may see write.lock hanging around from time to time: its harmless.
  • LUCENE-5624: Ensure NativeFSLockFactory does not leak file handles if it is unable to obtain the lock.
  • LUCENE-5626: Fix bug in SimpleFSLockFactory's obtain() that sometimes throwed IOException (ERROR_ACESS_DENIED) on Windows if the lock file was created concurrently. This error is now handled the same way like in NativeFSLockFactory by returning false.
  • LUCENE-5630: Add missing META-INF entry for UpperCaseFilterFactory.

New in Apache Lucene 4.7.2 (Apr 15, 2014)

  • Bug Fixes:
  • LUCENE-5574: Closing a near-real-time reader no longer attempts to delete unreferenced files if the original writer has been closed; this could cause index corruption in certain cases where index files were directly changed (deleted, overwritten, etc.) in the index directory outside of Lucene.
  • LUCENE-5570: Don't let FSDirectory.sync() create new zero-byte files, instead throw exception if a file is missing.

New in Apache Lucene 4.7.1 (Apr 2, 2014)

  • Changes in Runtime Behavior:
  • LUCENE-5532: AutomatonQuery.equals is no longer implemented as "accepts same language". This was inconsistent with hashCode, and unnecessary for any subclasses in Lucene. If you desire this in a custom subclass, minimize the automaton.
  • Bug Fixes:
  • LUCENE-5450: Fix getField() NPE issues with SpanOr/SpanNear when they have an empty list of clauses. This can happen for example, when a wildcard matches no terms.
  • LUCENE-5473: Throw IllegalArgumentException, not NullPointerException, if the synonym map is empty when creating SynonymFilter
  • LUCENE-5432: EliasFanoDocIdSet: Fix number of index entry bits when the maximum entry is a power of 2.
  • LUCENE-5466: query is always null in countDocsWithClass() of SimpleNaiveBayesClassifier.
  • LUCENE-5502: Fixed TermsFilter.equals that could return true for different filters.
  • LUCENE-5522: FacetsConfig didn't add drill-down terms for association facet fields labels.
  • LUCENE-5520: ToChildBlockJoinQuery would hit ArrayIndexOutOfBoundsException if a parent document had no children
  • LUCENE-5532: AutomatonQuery.hashCode was not thread-safe.
  • LUCENE-5525: Implement MultiFacets.getAllDims, so you can do sparse facets through DrillSideways, for example.
  • LUCENE-5481: IndexWriter.forceMerge used to run a merge even if there was a single segment in the index.
  • LUCENE-5538: Fix FastVectorHighlighter bug with index-time synonyms when the query is more complex than a single phrase.
  • LUCENE-5544: Exceptions during IndexWriter.rollback could leak file handles and the write lock.
  • LUCENE-4978: Spatial RecursivePrefixTree queries could result in false-negatives for indexed shapes within 1/2 maxDistErr from the edge of the query shape. This meant searching for a point by the same point as a query rarely worked.
  • LUCENE-5553: IndexReader#ReaderClosedListener is not always invoked when IndexReader#close() is called or if refCount is 0. If an exception is thrown during interal close or on any of the close listerns some or all listerners might be missed. This can cause memory leaks if the core listeners are used to clear caches.
  • Build:
  • LUCENE-5511: "ant precommit" / "ant check-svn-working-copy" now work again with any working copy format (thanks to svnkit 1.8.4).

New in Apache Lucene 4.7.0 (Feb 27, 2014)

  • New Features:
  • Add SimpleQueryParser: parser for human-entered queries.
  • Add Payload support to FileDictionary (Suggest) and make it more configurable
  • Add .getCount method to all suggesters (Lookup); persist count metadata on .store(); Dictionary returns InputIterator; Dictionary.getWordIterator renamed to .getEntryIterator.
  • The RangeMapFloatFunction accepts an arbitrary ValueSource as target and default values.
  • Speed up Lucene range faceting from O(N) per hit to O(log(N)) per hit using segment trees; this only really starts to matter in practice if the number of ranges is over 10 or so.
  • Add Analyzer for Kurdish.
  • Added an UpperCaseFilter to make UPPERCASE tokens.
  • Add a new BlendedInfixSuggester, which is like AnalyzingInfixSuggester but boosts suggestions that matched tokens with lower positions.
  • When sorting by String (SortField.STRING), you can now specify whether missing values should be sorted first (the default), using SortField.setMissingValue(SortField.STRING_FIRST), or last, using SortField.setMissingValue(SortField.STRING_LAST).
  • QueryNode should have the ability to detach from its node parent. Added QueryNode.removeFromParent() that allows nodes to be detached from its parent node.
  • Upgrade to Spatial4j 0.4.1
  • Add multitermquery (wildcards,prefix,etc) to PostingsHighlighter.
  • Add two memory resident dictionaries (FST terms dictionary and FSTOrd terms dictionary) to improve primary key lookups. The PostingsBaseFormat API is also changed so that term dictionaries get the ability to block encode term metadata, and all dictionary implementations can now plug in any PostingsBaseFormat.
  • ShingleFilter's filler token should be configurable.
  • Add SearcherTaxonomyManager over search and taxonomy index directories (i.e. not only NRT).
  • Add fuzzy and near support via '~' operator to SimpleQueryParser.
  • Make SortedSetDocValuesReaderState abstract to allow custom implementations for Lucene doc values faceting
  • NRT support for file systems that do no have delete on last close or cannot delete while referenced semantics.
  • Drilling down or sideways on a Lucene facet range (using Range.getFilter()) is now faster for costly filters (uses random access, not iteration); range facet counts now accept a fast-match filter to avoid computing the value for documents that are out of bounds, e.g. using a bounding box filter with distance range faceting.
  • Add LongBitSet for managing more than 2.1B bits (otherwise use FixedBitSet).
  • ASCIIFoldingFilter now has an option to preserve the original token and emit it on the same position as the folded token only if the actual token was folded.
  • Add spatial SerializedDVStrategy that serializes a binary representations of a shape into BinaryDocValues. It supports exact geometry relationship calculations.
  • Add SloppyMath.earthDiameter(double latitude) that returns an approximate value of the diameter of the earth at the given latitude.
  • (Adrien Grand)
  • Bug Fixes:
  • Improved highlighting of multi-valued fields with FastVectorHighlighter.
  • UAX29URLEmailTokenizer should not tokenize no-scheme domain-only URLs that are followed by an alphanumeric character.
  • If an analysis component throws an exception, Lucene logs the field name to the info stream to assist in diagnosis.
  • PriorityQueue now refuses to allocate itself if the incoming maxSize is too large
  • IndexWriter.addIndexes(Directory[]) now acquires a write lock in each Directory, to ensure that no open IndexWriter is changing the incoming indices. This also means that you cannot pass the same Directory to multiple concurrent addIndexes calls (which is anyways unusual).
  • SpanMultiTermQueryWrapper didn't handle its boost in hashcode/equals/tostring/rewrite.
  • ToParentBlockJoinCollector.getTopGroups would fail to return any groups when the joined query required more than one rewrite step
  • NormValueSource was incorrectly casting the long value to byte, before calling Similarity.decodeNormValue.
  • RefrenceManager#accquire can result in infinite loop if managed resource is abused outside of the RefrenceManager. Decrementing the reference without a corresponding incRef() call can cause an infinite loop. ReferenceManager now throws IllegalStateException if currently managed resources ref count is 0.
  • Lucene45DocValuesProducer.ramBytesUsed() may throw ConcurrentModificationException.
  • MemoryIndex did't respect the analyzers offset gap and offsets were corrupted if multiple fields with the same name were added to the memory index.
  • StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet
  • RamUsageEstimator.sizeOf(Object) is not used anymore to estimate memory usage of segments. This used to make SegmentReader.ramBytesUsed very CPU-intensive.
  • ControlledRealTimeReopenThread would sometimes wait too long (up to targetMaxStaleSec) when a searcher is waiting for a specific generation, when it should have waited for at most targetMinStaleSec.

New in Apache Lucene 4.6.1 (Jan 28, 2014)

  • Bug fixes:
  • LUCENE-5373: Memory usage of [Lucene40/Lucene42/Memory/Direct]DocValuesFormat was over-estimated.
  • LUCENE-5361: Fixed handling of query boosts in FastVectorHighlighter.
  • LUCENE-5374: IndexWriter processes internal events after the it closed itself internally. This rare condition can happen if an IndexWriter has internal changes that were not fully applied yet like when index / flush requests happen concurrently to the close or rollback call.
  • LUCENE-5394: Fix TokenSources.getTokenStream to return payloads if they were indexed with the term vectors.
  • LUCENE-5344: Flexible StandardQueryParser behaves differently than ClassicQueryParser.
  • LUCENE-5375: ToChildBlockJoinQuery works harder to detect mis-use, when the parent query incorrectly returns child documents, and throw a clear exception saying so.
  • LUCENE-5401: Field.StringTokenStream#end() calls super.end() now, preventing wrong term positions for fields that use StringTokenStream.
  • LUCENE-5377: IndexWriter.addIndexes(Directory[]) would cause corruption on Lucene 4.6 if any index segments were Lucene 4.0-4.5.

New in Apache Lucene 4.6.0 (Nov 22, 2013)

  • New Features:
  • LUCENE-4906: PostingsHighlighter can now render to custom Object, for advanced use cases where String is too restrictive
  • LUCENE-5133: Changed AnalyzingInfixSuggester.highlight to return Object instead of String, to allow for advanced use cases where String is too restrictive
  • LUCENE-5207, LUCENE-5334: Added expressions module for customizing ranking with script-like syntax.
  • LUCENE-5180: ShingleFilter now creates shingles with trailing holes, for example if a StopFilter had removed the last token.
  • LUCENE-5219: Add support to SynonymFilterFactory for custom parsers.
  • LUCENE-5235: Tokenizers now throw an IllegalStateException if the consumer does not call reset() before consuming the stream. Previous versions throwed NullPointerException or ArrayIndexOutOfBoundsException on best effort which was not user-friendly.
  • LUCENE-5240: Tokenizers now throw an IllegalStateException if the consumer neglects to call close() on the previous stream before consuming the next one.
  • LUCENE-5214: Add new FreeTextSuggester, to predict the next word using a simple ngram language model. This is useful for the "long tail" suggestions, when a primary suggester fails to find a suggestion.
  • LUCENE-5251: New DocumentDictionary allows building suggesters via contents of existing field, weight and optionally payload stored fields in an index
  • LUCENE-5261: Add QueryBuilder, a simple API to build queries from the analysis chain directly, or to make it easier to implement query parsers.
  • LUCENE-5270: Add Terms.hasFreqs, to determine whether a given field indexed per-doc term frequencies.
  • LUCENE-5269: Add CodepointCountFilter.
  • LUCENE-5294: Suggest module: add DocumentExpressionDictionary to compute each suggestion's weight using a javascript expression.
  • LUCENE-5274: FastVectorHighlighter now supports highlighting against several indexed fields.
  • LUCENE-5304: SingletonSortedSetDocValues can now return the wrapped SortedDocValues
  • LUCENE-2844: The benchmark module can now test the spatial module. See spatial.alg
  • LUCENE-5302: Make StemmerOverrideMap's methods public
  • LUCENE-5296: Add DirectDocValuesFormat, which holds all doc values in heap as uncompressed java native arrays.
  • LUCENE-5189: Add IndexWriter.updateNumericDocValues, to update numeric DocValues fields of documents, without re-indexing them.
  • LUCENE-5298: Add SumValueSourceFacetRequest for aggregating facets by a ValueSource, such as a NumericDocValuesField or an expression.
  • LUCENE-5323: Add .sizeInBytes method to all suggesters (Lookup).
  • LUCENE-5312: Add BlockJoinSorter, a new Sorter implementation that makes sure to never split up blocks of documents indexed with IndexWriter.addDocuments.
  • LUCENE-5297: Allow to range-facet on any ValueSource, not just NumericDocValues fields.
  • Bug Fixes:
  • LUCENE-5272: OpenBitSet.ensureCapacity did not modify numBits, causing false assertion errors in fastSet.
  • LUCENE-5303: OrdinalsCache did not use coreCacheKey, resulting in over caching across multiple threads.
  • LUCENE-5307: Fix topScorer inconsistency in handling QueryWrapperFilter inside ConstantScoreQuery, which now rewrites to a query removing the obsolete QueryWrapperFilter.
  • LUCENE-5330: IndexWriter didn't process all internal events on #getReader(), #close() and #rollback() which causes files to be deleted at a later point in time. This could cause short-term disk pollution or OOM if in-memory directories are used.
  • LUCENE-5342: Fixed bulk-merge issue in CompressingStoredFieldsFormat which created corrupted segments when mixing chunk sizes. Lucene41StoredFieldsFormat is not impacted.
  • API Changes:
  • LUCENE-5222: Add SortField.needsScores(). Previously it was not possible for a custom Sort that makes use of the relevance score to work correctly with IndexSearcher when an ExecutorService is specified.
  • LUCENE-5275: Change AttributeSource.toString() to display the current state of attributes.
  • LUCENE-5277: Modify FixedBitSet copy constructor to take an additional numBits parameter to allow growing/shrinking the copied bitset. You can use FixedBitSet.clone() if you only need to clone the bitset.
  • LUCENE-5260: Use TermFreqPayloadIterator for all suggesters; those suggesters that can't support payloads will throw an exception if hasPayloads() is true.
  • LUCENE-5280: Rename TermFreqPayloadIterator -> InputIterator, along with associated suggest/spell classes.
  • LUCENE-5157: Rename OrdinalMap methods to clarify API and internal structure.
  • LUCENE-5313: Move preservePositionIncrements from setter to ctor in Analyzing/FuzzySuggester.
  • LUCENE-5321: Remove Facet42DocValuesFormat. Use DirectDocValuesFormat if you want to load the category list into memory.
  • LUCENE-5324: AnalyzerWrapper.getPositionIncrementGap and getOffsetGap can now be overridden.
  • Optimizations:
  • LUCENE-5225: The ToParentBlockJoinQuery only keeps tracks of the the child doc ids and child scores if the ToParentBlockJoinCollector is used.
  • LUCENE-5236: EliasFanoDocIdSet now has an index and uses broadword bit selection to speed-up advance().
  • LUCENE-5266: Improved number of read calls and branches in DirectPackedReader.
  • LUCENE-5300: Optimized SORTED_SET storage for fields which are single-valued.
  • Documentation:
  • LUCENE-5211: Better javadocs and error checking of 'format' option in StopFilterFactory, as well as comments in all snowball formated files about specifying format option.

New in Apache Lucene 4.5.1 (Oct 25, 2013)

  • Bug Fixes:
  • LUCENE-4998: Fixed a few places to pass IOContext.READONCE instead of IOContext.READ
  • LUCENE-5242: DirectoryTaxonomyWriter.replaceTaxonomy did not fully reset its state, which could result in exceptions being thrown, as well as incorrect ordinals returned from getParent.
  • LUCENE-5254: Fixed bounded memory leak, where objects like live docs bitset were not freed from an starting reader after reopening to a new reader and closing the original one.
  • LUCENE-5262: Fixed file handle leaks when multiple attempts to open an NRT reader hit exceptions.
  • LUCENE-5263: Transient IOExceptions, e.g. due to disk full or file descriptor exhaustion, hit at unlucky times inside IndexWriter could lead to silently losing deletions.
  • LUCENE-5264: CommonTermsQuery ignored minMustMatch if only high-frequent terms were present in the query and the high-frequent operator was set to SHOULD.
  • LUCENE-5269: Fix bug in NGramTokenFilter where it would sometimes count unicode characters incorrectly.
  • LUCENE-5289: IndexWriter.hasUncommittedChanges was returning false when there were buffered delete-by-Term.

New in Apache Lucene 4.5.0 (Oct 7, 2013)

  • New Features:
  • Added new Elias-Fano encoder, decoder and DocIdSet implementations.
  • Added WAHDocIdSet, an in-memory doc id set implementation based on word-aligned hybrid encoding.
  • New broadword utility methods in oal.util.BroadWord.
  • FuzzySuggester now supports optional unicodeAware (default is false). If true then edits are measured in Unicode code points instead of UTF bytes.
  • SpatialStrategy.makeDistanceValueSource() now has an optional multiplier for scaling degrees to another unit.
  • SpanNotQuery can now be configured with pre and post slop to act as a hypothetical SpanNotNearQuery.
  • FacetsAccumulator.create() is now able to create a MultiFacetsAccumulator over a mixed set of facet requests. MultiFacetsAccumulator allows wrapping multiple FacetsAccumulators, allowing to easily mix existing and custom ones. TaxonomyFacetsAccumulator supports any FacetRequest which implements createFacetsAggregator and was indexed using the taxonomy index.
  • AnalyzerWrapper.wrapReader allows wrapping the Reader given to inputReader.
  • FacetRequest.getValueOf and .getFacetArraysSource replaced by FacetsAggregator.createOrdinalValueResolver. This gives better options for resolving an ordinal's value by FacetAggregators.
  • Add SuggestStopFilter, to be used with analyzing suggesters, so that a stop word at the very end of the lookup query, and without any trailing token characters, will be preserved. This enables query "a" to suggest apple; see http://blog.mikemccandless.com///suggeststopfilter-carefully-removes.html for details.
  • Added support for missing values to DocValues fields. AtomicReader.getDocsWithField returns a Bits of documents with a value, and FieldCache.getDocsWithField forwards to that for DocValues fields. Things like SortField.setMissingValue, FunctionValues.exists, and FieldValueFilter now work with DocValues fields.
  • Lucene . has a new LuceneCodec with LuceneDocValues, supporting missing values and with most datastructures residing off-heap. Added "Memory" docvalues format that works entirely in heap, and "Disk" loads no datastructures into RAM. Both of these also support missing values. Added DiskNormsFormat (in case you want norms entirely on disk).
  • Added PForDeltaDocIdSet, an in-memory doc id set implementation based on the PFOR encoding.
  • Added CachingWrapperFilter.getFilter in order to be able to get the wrapped filter.
  • Added SegmentReader.ramBytesUsed to return approximate heap RAM used by index datastructures.
  • Bug Fixes:
  • IndexWriter.addIndexes(IndexReader...) should drop empty (or all deleted) segments.
  • Spatial RecursivePrefixTree Contains predicate will throw an NPE when there's no indexed data and maybe in other circumstances too.
  • AnalyzingSuggester sort comparator read part of the input key as the weight that caused the sorter to never sort by weight first since the weight is only considered if the input is equal causing the malformed weight to be identical as well.
  • Associations FacetsAggregators could enter an infinite loop when some result documents were missing category associations.
  • Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum seek/lookup which can cause sideeffects if done on a cached FST root arc.
  • Handle the case where reading from a file or FileChannel returns -, which could happen in rare cases where something happens to the file between the time we start the read loop (where we check the length) and when we actually do the read.
  • PostingsHighlighter would throw IOOBE if a term spanned the maxLength boundary, made it into the top-N and went to the formatter.
  • Indexing core no longer enforces a limit on maximum length binary doc values fields, but individual codecs (including the default one) have their own limits
  • TokenStreams now set the position increment in end(), so we can handle trailing holes. If you have a custom TokenStream implementing end() then be sure it calls super.end().
  • IndexWriter could allow adding same field name with different DocValueTypes under some circumstances.
  • SimpleHTMLEncoder in Highlighter module broke Unicode outside BMP because it encoded UTF- chars instead of codepoints. The escaping of codepoints > was removed (not needed for valid HTML) and missing escaping for ' and / was added.
  • Fixed compression bug in LZ.compressHC when the input is highly compressible and the start offset of the array to compress is > .
  • SimilarityBase did not write norms the same way as DefaultSimilarity if discountOverlaps == false and index-time boosts are present for the field.
  • Fixed IndexUpgrader command line parsing: -verbose is not required and -dir-impl option now works correctly.
  • Fix MultiTermQuery's constant score rewrites to always return a ConstantScoreQuery to make scoring consistent. Previously it returned an empty unwrapped BooleanQuery, if no terms were available, which has a different query norm.
  • In some cases, trying to retrieve or merge a -length binary doc value would hit an ArrayIndexOutOfBoundsException.
  • API Changes:
  • Add ramBytesUsed() to MultiDocValues.OrdinalMap.
  • Remove unused boolean useCache parameter from TermsEnum.seekCeil and .seekExact
  • IndexSearcher.searchAfter throws IllegalArgumentException if searchAfter exceeds the number of documents in the reader.
  • CategoryAssociationsContainer no longer supports null association values for categories. If you want to index categories without associations, you should add them using FacetFields.
  • IndexWriter no longer clones the given IndexWriterConfig. If you need to use the same config more than once, e.g. when sharing between multiple writers, make sure to clone it before passing to each writer.
  • StandardFacetsAccumulator renamed to OldFacetsAccumulator, and all associated classes were moved under o.a.l.facet.old. The intention to remove it one day, when the features it covers (complements, partitiona, sampling) will be migrated to the new FacetsAggregator and FacetsAccumulator API. Also, FacetRequest.createAggregator was replaced by OldFacetsAccumulator.createAggregator.
  • CommonTermsQuery now allows to set the minimum number of terms that should match for its high and low frequent sub-queries. Previously this was only supported on the low frequent terms query.
  • CompressingTermVectors TermsEnum no longer supports ord().
  • LUCENEFix default chunk sizes in FSDirectory to not be unnecessarily large (now bytes); also use chunking when writing to index files. FSDirectory#setReadChunkSize() is now deprecated and will be removed in Lucene ..
  • Analyzer.ReuseStrategy instances are now stateless and can be reused in other Analyzer instances, which was not possible before. Lucene ships now with stateless singletons for per field and global reuse. Legacy code can still instantiate the deprecated implementation classes, but new code should use the constants. Implementors of custom strategies have to take care of new method signatures. AnalyzerWrapper can now be configured to use a custom strategy, too, ideally the one from the wrapped Analyzer. Analyzer adds a getter to retrieve the strategy for this use-case.
  • Lucene never writes segments with documents anymore.
  • SortedDocValues always returns - ord when a document is missing a value for the field. Previously it only did this if the SortedDocValues was produced by uninversion on the FieldCache.
  • remove BinaryDocValues.MISSING. In order to determine a document is missing a field, use getDocsWithField instead.
  • Changes in Runtime Behavior:
  • DocValues codec consumer APIs (iterables) return null values when the document has no value for the field.
  • The HighFreqTerms command-line tool returns the true top-N by totalTermFreq when using the -t option, it uses the term statistics (faster) and now always shows totalTermFreq in the output.
  • Optimizations:
  • Added TermFilter to filter docs by a specific term.
  • DiskDV keeps the document-to-ordinal mapping on disk for SortedDocValues.
  • New AppendingPackedLongBuffer, a new variant of the former AppendingLongBuffer which assumes values are -based.
  • All Appending*Buffer now support bulk get.
  • Fixed a performance regression of span queries caused by LUCENE-.
  • Make WAHDocIdSet able to inverse its encoding in order to compress dense sets efficiently as well.
  • Prefix-code the sorted/sortedset value dictionaries in DiskDV.
  • Fixed several wrapper analyzers to inherit the reuse strategy of the wrapped Analyzer.
  • Simplified DocumentsWriter and DocumentsWriterPerThread synchronization and concurrent interaction with IndexWriter. DWPT is now only setup once and has no reset logic. All segment publishing and state transition from DWPT into IndexWriter is now done via an Event-Queue processed from within the IndexWriter in order to prevent suituations where DWPT or DW calling int IW causing deadlocks.
  • Terminate phrase searches early if max phrase window is exceeded in FastVectorHighlighter to prevent very long running phrase extraction if phrase terms are high frequent.
  • CompressingStoredFieldsFormat now slices chunks containing big documents into fixed-size blocks so that requesting a single field does not necessarily force to decompress the whole chunk.
  • CachingWrapper makes it easier to plug-in a custom cacheable DocIdSet implementation and uses WAHDocIdSet by default, which should be more memory efficient than FixedBitSet on average as well as faster on small sets.
  • Documentation:
  • remove facet userguide as it was outdated. Partially absorbed into package's documentation and classes javadocs.
  • Clarify FuzzyQuery's unexpected behavior on short terms.
  • Changes in backwards compatibility policy:
  • CheckIndex.fixIndex(Status,Codec) is now CheckIndex.fixIndex(Status). If you used to pass a codec to this method, just remove it from the arguments.
  • -, SOLRUpdate to Morfologik ... MorfologikAnalyzer and MorfologikFilter no longer support multiple "dictionaries" as there is only one dictionary available.
  • Changed method signatures of Analyzer.ReuseStrategy to take Analyzer. Closeable interface was removed because the class was changed to be stateless.
  • SlowCompositeReaderWrapper constructor is now private, SlowCompositeReaderWrapper.wrap should be used instead.
  • CachingWrapperFilter doesn't always return FixedBitSet instances anymore. Users of the join module can use oal.search.join.FixedBitSetCachingWrapperFilter instead.
  • Build:
  • SOLRManifest includes non-parsed maven variables.
  • Add jar-src as top-level target to generate all Lucene and Solr *-src.jar.

New in Apache Lucene 4.4.0 (Jul 24, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
  • LUCENE-4955: NGramTokenFilter now emits all n-grams for the same token at the same position and preserves the position length and the offsets of the original token.
  • LUCENE-4955: NGramTokenizer now emits n-grams in a different order (a, ab, b, bc, c) instead of (a, b, c, ab, bc) and doesn't trim trailing whitespaces.
  • LUCENE-5042: The n-gram and edge n-gram tokenizers and filters now correctly handle supplementary characters, and the tokenizers have the ability to pre-tokenize the input stream similarly to CharTokenizer.
  • LUCENE-4967: NRTManager is replaced by ControlledRealTimeReopenThread, for controlling which requests must see which indexing changes, so that it can work with any ReferenceManager
  • LUCENE-4973: SnapshotDeletionPolicy no longer requires a unique String id
  • LUCENE-4946: The internal sorting API (SorterTemplate, now Sorter) has been completely refactored to allow for a better implementation of TimSort.
  • LUCENE-4963: Some TokenFilter options that generate broken TokenStreams have been deprecated: updateOffsets=true on TrimFilter and enablePositionIncrements=false on all classes that inherit from FilteringTokenFilter: JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, StopFilter and TypeTokenFilter.
  • LUCENE-4963: In order not to take position increments into account in suggesters, you now need to call setPreservePositionIncrements(false) instead of configuring the token filters to not increment positions.
  • LUCENE-3907: EdgeNGramTokenizer now supports maxGramSize > 1024, doesn't trim the input, sets position increment = 1 for all tokens and doesn't support backward grams anymore.
  • LUCENE-3907: EdgeNGramTokenFilter does not support backward grams and does not update offsets anymore.
  • LUCENE-4981: PositionFilter is now deprecated as it can corrupt token stream graphs. Since it main use-case was to make query parsers generate boolean queries instead of phrase queries, it is now advised to use QueryParser.setAutoGeneratePhraseQueries(false) (for simple cases) or to override QueryParser.newFieldQuery.
  • LUCENE-5018: CompoundWordTokenFilterBase and its children DictionaryCompoundWordTokenFilter and HyphenationCompoundWordTokenFilter don't update offsets anymore.
  • LUCENE-5015: SamplingAccumulator no longer corrects the counts of the sampled categories. You should set TakmiSampleFixer on SamplingParams if required (but notice that this means slower search).
  • LUCENE-4933: Replace ExactSimScorer/SloppySimScorer with just SimScorer. Previously there were 2 implementations as a performance hack to support tableization of sqrt(), but this caching is removed, as sqrt is implemented in hardware with modern jvms and its faster not to cache.
  • LUCENE-5038: MergePolicy now has a default implementation for useCompoundFile based on segment size and noCFSRatio. The default implemantion was pulled up from TieredMergePolicy.
  • LUCENE-5063: FieldCache.get(Bytes|Shorts), SortField.Type.(BYTE|SHORT) and FieldCache.DEFAULT_(BYTE|SHORT|INT|LONG|FLOAT|DOUBLE)_PARSER are now deprecated. These methods/types assume that data is stored as strings although Lucene has much better support for numeric data through (Int|Long)Field, NumericRangeQuery and FieldCache.get(Int|Long)s.
  • LUCENE-5078: TfIDFSimilarity lets you encode the norm value as any arbitrary long. As a result, encode/decodeNormValue were made abstract with their signatures changed. The default implementation was moved to DefaultSimilarity, which encodes the norm as a single-byte value.
  • Bug Fixes:
  • LUCENE-4890: QueryTreeBuilder.getBuilder() only finds interfaces on the most derived class.
  • LUCENE-4997: Internal test framework's tests are sensitive to previous test failures and tests.failfast.
  • LUCENE-4955: NGramTokenizer now supports inputs larger than 1024 chars.
  • LUCENE-4959: Fix incorrect return value in SimpleNaiveBayesClassifier.assignClass.
  • LUCENE-4972: DirectoryTaxonomyWriter created empty commits even if no changes were made.
  • LUCENE-949: AnalyzingQueryParser can't work with leading wildcards.
  • LUCENE-4980: Fix issues preventing mixing of RangeFacetRequest and non-RangeFacetRequest when using DrillSideways.
  • LUCENE-4996: Ensure DocInverterPerField always includes field name in exception messages.
  • LUCENE-4992: Fix constructor of CustomScoreQuery to take FunctionQuery for scoringQueries. Instead use QueryValueSource to safely wrap arbitrary queries and use them with CustomScoreQuery.
  • LUCENE-5016: SamplingAccumulator returned inconsistent label if asked to aggregate a non-existing category. Also fixed a bug in RangeAccumulator if some readers did not have the requested numeric DV field.
  • LUCENE-5028: Remove pointless and confusing doShare option in FST's PositiveIntOutputs
  • LUCENE-5032: Fix IndexOutOfBoundsExc in PostingsHighlighter when multi-valued fields exceed maxLength
  • LUCENE-4933: SweetSpotSimilarity didn't apply its tf function to some queries (SloppyPhraseQuery, SpanQueries).
  • LUCENE-5033: SlowFuzzyQuery was accepting too many terms (documents) when provided minSimilarity is an int > 1
  • LUCENE-5045: DrillSideways.search did not work on an empty index.
  • LUCENE-4995: CompressingStoredFieldsReader now only reuses an internal buffer when there is no more than 32kb to decompress. This prevents from running into out-of-memory errors when working with large stored fields.
  • LUCENE-5062: If the spatial data for a document was comprised of multiple overlapping or adjacent parts then a CONTAINS predicate query might not match when the sum of those shapes contain the query shape but none do individually. A flag was added to use the original faster algorithm.
  • LUCENE-4971: Fixed NPE in AnalyzingSuggester when there are too many graph expansions.
  • LUCENE-5080: Combined setMaxMergeCount and setMaxThreadCount into one setter in ConcurrentMergePolicy: setMaxMergesAndThreads. Previously these setters would not work unless you invoked them very carefully.
  • LUCENE-5068: QueryParserUtil.escape() does not escape forward slash.
  • LUCENE-5103: A join on A single-valued field with deleted docs scored too few docs.
  • LUCENE-5090: Detect mismatched readers passed to SortedSetDocValuesReaderState and SortedSetDocValuesAccumulator.
  • LUCENE-5120: AnalyzingSuggester modifed it's FST's cached root arc if payloads are used and the entire output resided on the root arc on the first access. This caused subsequent suggest calls to fail.
  • Optimizations:
  • LUCENE-4936: Improve numeric doc values compression in case all values share a common divisor. In particular, this improves the compression ratio of dates without time when they are encoded as milliseconds since Epoch. Also support TABLE compressed numerics in the Disk codec.
  • LUCENE-4951: DrillSideways uses the new Scorer.cost() method to make better decisions about which scorer to use internally.
  • LUCENE-4976: PersistentSnapshotDeletionPolicy writes its state to a single snapshots_N file, and no longer requires closing
  • LUCENE-5035: Compress addresses in FieldCacheImpl.SortedDocValuesImpl more efficiently.
  • LUCENE-4941: Sort "from" terms only once when using JoinUtil.
  • LUCENE-5050: Close the stored fields and term vectors index files as soon as the index has been loaded into memory to save file descriptors.
  • LUCENE-5086: RamUsageEstimator now uses official Java 7 API or a proprietary Oracle Java 6 API to get Hotspot MX bean, preventing AWT classes to be loaded on MacOSX.
  • New Features:
  • LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
  • LUCENE-5064: Added PagedMutable (internal), a paged extension of PackedInts.Mutable which allows for storing more than 2B values.
  • LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to emit multiple tokens one for each capture group in one or more patterns.
  • LUCENE-4952: Expose control (protected method) in DrillSideways to force all sub-scorers to be on the same document being collected. This is necessary when using collectors like ToParentBlockJoinCollector with DrillSideways.
  • SOLR-4761: Add SimpleMergedSegmentWarmer, which just initializes terms, norms, docvalues, and so on.
  • LUCENE-4964: Allow arbitrary Query for per-dimension drill-down to DrillDownQuery and DrillSideways, to support future dynamic faceting methods
  • LUCENE-4966: Add CachingWrapperFilter.sizeInBytes()
  • LUCENE-4965: Add dynamic (no taxonomy index used) numeric range faceting to Lucene's facet module
  • LUCENE-4979: LiveFieldFields can work with any ReferenceManager, not just ReferenceManager
  • LUCENE-4975: Added a new Replicator module which can replicate index revisions between server and client.
  • LUCENE-5022: Added FacetResult.mergeHierarchies to merge multiple FacetResult of the same dimension into a single one with the reconstructed hierarchy.
  • LUCENE-5026: Added PagedGrowableWriter, a new internal packed-ints structure that grows the number of bits per value on demand, can store more than 2B values and supports random write and read access.
  • LUCENE-5025: FST's Builder can now handle more than 2.1 billion "tail nodes" while building a minimal FST.
  • LUCENE-5063: FieldCache.DEFAULT.get(Ints|Longs) now uses bit-packing to save memory.
  • LUCENE-5079: IndexWriter.hasUncommittedChanges() returns true if there are changes that have not been committed.
  • SOLR-4565: Extend NorwegianLightStemFilter and NorwegianMinimalStemFilter to handle "nynorsk"
  • LUCENE-5087: Add getMultiValuedSeparator to PostingsHighlighter, for cases where you want a different logical separator between field values. This can be set to e.g. U+2029 PARAGRAPH SEPARATOR if you never want passes to span values.
  • LUCENE-5013: Added ScandinavianFoldingFilterFactory and ScandinavianNormalizationFilterFactory
  • LUCENE-4845: AnalyzingInfixSuggester finds suggestions based on matches to any tokens in the suggestion, not just based on pure prefix matching.
  • API Changes:
  • LUCENE-5077: Make it easier to use compressed norms. Lucene42NormsFormat takes an overhead parameter, so you can easily pass a different value other than PackedInts.FASTEST from your own codec.
  • LUCENE-5097: Analyzer now has an additional tokenStream(String fieldName, String text) method, so wrapping by StringReader for common use is no longer needed. This method uses an internal reuseable reader, which was previously only used by the Field class.
  • LUCENE-4542: HunspellStemFilter's maximum recursion level is now configurable.
  • Build:
  • LUCENE-4987: Upgrade randomized testing to version 2.0.10: Test framework may fail internally due to overly aggresive J9 optimizations.
  • LUCENE-5043: The eclipse target now uses the containing directory for the project name. This also enforces UTF-8 encoding when files are copied with filtering.
  • LUCENE-5055: "rat-sources" target now checks also build.xml, ivy.xml, forbidden-api signatures, and parts of resources folders.
  • LUCENE-5072: Automatically patch javadocs generated by JDK versions before 7u25 to work around the frame injection vulnerability (CVE-2013-1571, VU#225657).
  • Tests:
  • LUCENE-4901: TestIndexWriterOnJRECrash should work on any JRE vendor via Runtime.halt().
  • Changes in runtime behavior:
  • LUCENE-5038: New segments written by IndexWriter are now wrapped into CFS by default. DocumentsWriterPerThread doesn't consult MergePolicy anymore to decide if a CFS must be written, instead IndexWriterConfig now has a property to enable / disable CFS for newly created segments.
  • LUCENE-5107: Properties files by Lucene are now written in UTF-8 encoding, Unicode is no longer escaped. Reading of legacy properties files with \u escapes is still possible.

New in Apache Lucene 4.3.1 (Jun 19, 2013)

  • Bug Fixes:
  • SOLR-4813: Fix SynonymFilterFactory to allow init parameters for tokenizer factory used when parsing synonyms file.
  • LUCENE-4935: CustomScoreQuery wrongly applied its query boost twice (boost^2).
  • LUCENE-4948: Fixed ArrayIndexOutOfBoundsException in PostingsHighlighter if you had a 64-bit JVM without compressed OOPS: IBM J9, or Oracle with large heap/explicitly disabled.
  • LUCENE-4953: Fixed ParallelCompositeReader to inform ReaderClosedListeners of its synthetic subreaders. FieldCaches keyed on the atomic childs will be purged earlier and FC insanity prevented. In addition, ParallelCompositeReader's toString() was changed to better reflect the reader structure.
  • LUCENE-4968: Fixed ToParentBlockJoinQuery/Collector: correctly handle parent hits that had no child matches, don't throw IllegalArgumentEx when the child query has no hits, more aggressively catch cases where childQuery incorrectly matches parent documents
  • LUCENE-4970: Fix boost value of rewritten NGramPhraseQuery.
  • LUCENE-4974: CommitIndexTask was broken if no params were set.
  • LUCENE-4986: Fixed case where a newly opened near-real-time reader fails to reflect a delete from IndexWriter.tryDeleteDocument
  • LUCENE-4991: Fix handling of synonyms in classic QueryParser.getFieldQuery for terms not separated by whitespace. PositionIncrementAttribute was ignored, so with default AND synonyms wrongly became mandatory clauses, and with OR, the coordination factor was wrong.
  • LUCENE-4994: Fix PatternKeywordMarkerFilter to have public constructor.
  • LUCENE-4993: Fix BeiderMorseFilter to preserve custom attributes when inserting tokens with position increment 0.
  • LUCENE-5002: IndexWriter#deleteAll() caused a deadlock in DWPT / DWSC if a DwPT was flushing concurrently while deleteAll() aborted all DWPT. The IW should never wait on DWPT via the flush control while holding on to the IW Lock.
  • Optimizations:
  • LUCENE-4938: Don't use an unnecessarily large priority queue in IndexSearcher methods that take top-N.

New in Apache Lucene 4.3.0 (May 3, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-4810: EdgeNGramTokenFilter no longer increments position for multiple ngrams derived from the same input token. (Walter Underwood via Mike McCandless)
  • LUCENE-4822: KeywordTokenFilter is now an abstract class. Subclasses need to implement #isKeyword() in order to mark terms as keywords. The existing functionality has been factored out into a new SetKeywordTokenFilter class. (Simon Willnauer, Uwe Schindler)
  • LUCENE-4642: Remove Tokenizer's and subclasses' ctors taking AttributeSource. (Renaud Delbru, Uwe Schindler, Steve Rowe)
  • LUCENE-4833: IndexWriterConfig used to use LogByteSizeMergePolicy when calling setMergePolicy(null) although the default merge policy is TieredMergePolicy. IndexWriterConfig setters now throw an exception when passed null if null is not a valid value. (Adrien Grand)
  • LUCENE-4849: Made ParallelTaxonomyArrays abstract with a concrete implementation for DirectoryTaxonomyWriter/Reader. Also moved it under o.a.l.facet.taxonomy. (Shai Erera)
  • LUCENE-4876: IndexDeletionPolicy is now an abstract class instead of an interface. IndexDeletionPolicy, MergeScheduler and InfoStream now implement Cloneable. (Adrien Grand)
  • LUCENE-4874: FilterAtomicReader and related classes (FilterTerms, FilterDocsEnum, ...) don't forward anymore to the filtered instance when the method has a default implementation through other abstract methods. (Adrien Grand, Robert Muir)
  • LUCENE-4642, LUCENE-4877: Implementors of TokenizerFactory, TokenFilterFactory, and CharFilterFactory now need to provide at least one constructor taking Map to be able to be loaded by the SPI framework (e.g., from Solr). In addition, TokenizerFactory needs to implement the abstract create(AttributeFactory,Reader) method. (Renaud Delbru, Uwe Schindler, Steve Rowe, Robert Muir)
  • API Changes:
  • LUCENE-4896: Made PassageFormatter abstract in PostingsHighlighter, made members of DefaultPassageFormatter protected. (Luca Cavanna via Robert Muir)
  • LUCENE-4844: removed TaxonomyReader.getParent(), you should use TaxonomyReader.getParallelArrays().parents() instead. (Shai Erera)
  • LUCENE-4742: Renamed spatial 'Node' to 'Cell', along with any method names and variables using this terminology. (David Smiley)
  • New Features:
  • LUCENE-4815: DrillSideways now allows more than one FacetRequest per dimension (Mike McCandless)
  • LUCENE-3918: IndexSorter has been ported to 4.3 API and now supports sorting documents by a numeric DocValues field, or reverse the order of the documents in the index. Additionally, apps can implement their own sort criteria. (Anat Hashavit, Shai Erera)
  • LUCENE-4817: Added KeywordRepeatFilter that allows to emit a token twice once as a keyword and once as an ordinary token allow stemmers to emit a stemmed version along with the un-stemmed version. (Simon Willnauer)
  • LUCENE-4822: PatternKeywordTokenFilter can mark tokens as keywords based on regular expressions. (Simon Willnauer, Uwe Schindler)
  • LUCENE-4821: AnalyzingSuggester now uses the ending offset to determine whether the last token was finished or not, so that a query "i " will no longer suggest "Isla de Muerta" for example. (Mike McCandless)
  • LUCENE-4642: Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory. (Renaud Delbru, Uwe Schindler, Steve Rowe)
  • LUCENE-4820: Add payloads to Analyzing/FuzzySuggester, to record an arbitrary byte[] per suggestion (Mike McCandless)
  • LUCENE-4816: Add WholeBreakIterator to PostingsHighlighter for treating the entire content as a single Passage. (Robert Muir, Mike McCandless)
  • LUCENE-4827: Add additional ctor to PostingsHighlighter PassageScorer to provide bm25 k1,b,avgdl parameters. (Robert Muir)
  • LUCENE-4607: Add DocIDSetIterator.cost() and Spans.cost() for optimizing scoring. (Simon Willnauer, Robert Muir)
  • LUCENE-4795: Add SortedSetDocValuesFacetFields and SortedSetDocValuesAccumulator, to compute topK facet counts from a field's SortedSetDocValues. This method only supports flat (dim/label) facets, is a bit (~25%) slower, has added cost per-IndexReader-open to compute its ordinal map, but it requires no taxonomy index and it tie-breaks facet labels in an understandable (by Unicode sort order) way. (Robert Muir, Mike McCandless)
  • LUCENE-4843: Add LimitTokenPositionFilter: don't emit tokens with positions that exceed the configured limit. (Steve Rowe)
  • LUCENE-4832: Add ToParentBlockJoinCollector.getTopGroupsWithAllChildDocs, to retrieve all children in each group. (Aleksey Aleev via Mike McCandless)
  • LUCENE-4846: PostingsHighlighter subclasses can override where the String values come from (it still defaults to pulling from stored fields). (Robert Muir, Mike McCandless)
  • LUCENE-4853: Add PostingsHighlighter.highlightFields method that takes int[] docIDs instead of TopDocs. (Robert Muir, Mike McCandless)
  • LUCENE-4856: If there are no matches for a given field, return the first maxPassages sentences (Robert Muir, Mike McCandless)
  • LUCENE-4859: IndexReader now exposes Terms statistics: getDocCount, getSumDocFreq, getSumTotalTermFreq. (Shai Erera)
  • LUCENE-4862: It is now possible to terminate collection of a single IndexReader leaf by throwing a CollectionTerminatedException in Collector.collect. (Adrien Grand, Shai Erera)
  • LUCENE-4752: New SortingMergePolicy (in lucene/misc) that sorts documents before merging segments. (Adrien Grand, Shai Erera, David Smiley)
  • LUCENE-4860: Customize scoring and formatting per-field in PostingsHighlighter by subclassing and overriding the getFormatter and/or getScorer methods. This also changes Passage.getMatchTerms() to return BytesRef[] instead of Term[]. (Robert Muir, Mike McCandless)
  • LUCENE-4839: Added SorterTemplate.timSort, a O(n log n) stable sort algorithm that performs well on partially sorted data. (Adrien Grand)
  • LUCENE-4644: Added support for the "IsWithin" spatial predicate for RecursivePrefixTreeStrategy. It's for matching non-point indexed shapes; if you only have points (1/doc) then "Intersects" is equivalent and faster. See the javadocs. (David Smiley)
  • LUCENE-4861: Make BreakIterator per-field in PostingsHighlighter. This means you can override getBreakIterator(String field) to use different mechanisms for e.g. title vs. body fields. (Mike McCandless, Robert Muir)
  • LUCENE-4645: Added support for the "Contains" spatial predicate for RecursivePrefixTreeStrategy. (David Smiley)
  • LUCENE-4898: DirectoryReader.openIfChanged now allows opening a reader on an IndexCommit starting from a near-real-time reader (previously this would throw IllegalArgumentException). (Mike McCandless)
  • LUCENE-4905: Made the maxPassages parameter per-field in PostingsHighlighter. (Robert Muir)
  • LUCENE-4897: Added TaxonomyReader.getChildren for traversing a category's children. (Shai Erera)
  • LUCENE-4902: Added FilterDirectoryReader to allow easy filtering of a DirectoryReader's subreaders. (Alan Woodward, Adrien Grand, Uwe Schindler)
  • LUCENE-4858: Added EarlyTerminatingSortingCollector to be used in conjunction with SortingMergePolicy, which allows to early terminate queries on sorted indexes, when the sort order matches the index order. (Adrien Grand, Shai Erera)
  • LUCENE-4904: Added descending sort order to NumericDocValuesSorter. (Shai Erera)
  • LUCENE-3786: Added SearcherTaxonomyManager, to manage access to both IndexSearcher and DirectoryTaxonomyReader for near-real-time faceting. (Shai Erera, Mike McCandless)
  • LUCENE-4915: DrillSideways now allows drilling down on fields that are not faceted. (Mike McCandless)
  • LUCENE-4895: Added support for the "IsDisjointTo" spatial predicate for RecursivePrefixTreeStrategy. (David Smiley)
  • LUCENE-4774: Added FieldComparator that allows sorting parent documents based on fields on the child / nested document level. (Martijn van Groningen)
  • Optimizations:
  • LUCENE-4839: SorterTemplate.merge can now be overridden in order to replace the default implementation which merges in-place by a faster implementation that could require fewer swaps at the expense of some extra memory. ArrayUtil and CollectionUtil override it so that their mergeSort and timSort methods are faster but only require up to 1% of extra memory. (Adrien Grand)
  • LUCENE-4571: Speed up BooleanQuerys with minNrShouldMatch to use skipping. (Stefan Pohl via Robert Muir)
  • LUCENE-4863: StemmerOverrideFilter now uses a FST to represent its overrides in memory. (Simon Willnauer)
  • LUCENE-4889: UnicodeUtil.codePointCount implementation replaced with a non-array-lookup version. (Dawid Weiss)
  • LUCENE-4923: Speed up BooleanQuerys processing of in-order disjunctions. (Robert Muir)
  • LUCENE-4926: Speed up DisjunctionMatchQuery. (Robert Muir)
  • LUCENE-4930: Reduce contention in older/buggy JVMs when using AttributeSource#addAttribute() because java.lang.ref.ReferenceQueue#poll() is implemented using synchronization. (Christian Ziech, Karl Wright, Uwe Schindler)
  • Bug Fixes:
  • LUCENE-4868: SumScoreFacetsAggregator used an incorrect index into the scores array. (Shai Erera)
  • LUCENE-4882: FacetsAccumulator did not allow to count ROOT category (i.e. count dimensions). (Shai Erera)
  • LUCENE-4876: IndexWriterConfig.clone() now clones its MergeScheduler, IndexDeletionPolicy and InfoStream in order to make an IndexWriterConfig and its clone fully independent. (Adrien Grand)
  • LUCENE-4893: Facet counts were multiplied as many times as FacetsCollector.getFacetResults() is called. (Shai Erera)
  • LUCENE-4888: Fixed SloppyPhraseScorer, MultiDocs(AndPositions)Enum and MultiSpansWrapper which happened to sometimes call DocIdSetIterator.advance with target

New in Apache Lucene 4.2.1 (Apr 4, 2013)

  • Bug Fixes:
  • LUCENE-4713: The SPI components used to load custom codecs or analysis components were fixed to also scan the Lucene ClassLoader in addition to the context ClassLoader, so Lucene is always able to find its own codecs. The special case of a null context ClassLoader is now also supported.
  • LUCENE-4819: seekExact(BytesRef, boolean) did not work correctly with Sorted[Set]DocValuesTermsEnum.
  • LUCENE-4826: PostingsHighlighter was not returning the top N best scoring passages.
  • LUCENE-4854: Fix DocTermOrds.getOrdTermsEnum() to not return negative ord on initial next().
  • LUCENE-4836: Fix SimpleRateLimiter#pause to return the actual time spent sleeping instead of the wakeup timestamp in nano seconds.
  • LUCENE-4828: BooleanQuery no longer extracts terms from its MUST_NOT clauses.
  • SOLR-4589: Fixed CPU spikes and poor performance in lazy field loading of multivalued fields.
  • LUCENE-4870: Fix bug where an entire index might be deleted by the IndexWriter due to false detection if an index exists in the directory when OpenMode.CREATE_OR_APPEND is used. This might also affect application that set the open mode manually using DirectoryReader#indexExists.
  • LUCENE-4878: Override getRegexpQuery in MultiFieldQueryParser to prefent NullPointerException when regular expression syntax is used with MultiFieldQueryParser.
  • Optimizations:
  • LUCENE-4819: Added Sorted[Set]DocValues.termsEnum(), and optimized the default codec for improved enumeration performance.
  • LUCENE-4854: Speed up TermsEnum of FieldCache.getDocTermOrds.
  • LUCENE-4857: Don't unnecessarily copy stem override map in StemmerOverrideFilter.

New in Apache Lucene 4.2.0 (Mar 12, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-4602: FacetFields now stores facet ordinals in a DocValues field, rather than a payload. This forces rebuilding existing indexes, or do a one time migration using FacetsPayloadMigratingReader. Since DocValues support in-memory caching, CategoryListCache was removed too. (Shai Erera, Michael McCandless)
  • LUCENE-4697: FacetResultNode is now a concrete class with public members (instead of getter methods). (Shai Erera)
  • LUCENE-4600: FacetsCollector is now an abstract class with two implementations: StandardFacetsCollector (the old version of FacetsCollector) and CountingFacetsCollector. FacetsCollector.create() returns the most optimized collector for the given parameters. (Shai Erera, Michael McCandless)
  • LUCENE-4700: OrdinalPolicy is now per CategoryListParams, and is no longer an interface, but rather an enum with values NO_PARENTS and ALL_PARENTS. PathPolicy was removed, you should extend FacetFields and DrillDownStream to control which categories are added as drill-down terms. (Shai Erera)
  • LUCENE-4547: DocValues improvements:
  • Simplified codec API: codecs are now only responsible for encoding and decoding docvalues, they do not need to do buffering or RAM accounting.
  • Per-Field support: added PerFieldDocValuesFormat, which allows you to use a different DocValuesFormat per field (like postings).
  • Unified with FieldCache api: DocValues can be accessed via FieldCache API, so it works automatically with grouping/join/sort/function queries, etc.
  • Simplified types: There are only 3 types (NUMERIC, BINARY, SORTED), so its not necessary to specify for example that all of your binary values have the same length. Instead its easy for the Codec API to optimize encoding based on any properties of the content. (Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
  • LUCENE-4757: Cleanup and refactoring of FacetsAccumulator, FacetRequest, FacetsAggregator and FacetResultsHandler API. If your application did FacetsCollector.create(), you should not be affected, but if you wrote an Aggregator, then you should migrate it to the per-segment FacetsAggregator. You can still use StandardFacetsAccumulator, which works with the old API (for now). (Shai Erera)
  • LUCENE-4761: Facet packages reorganized. Should be easy to fix your import statements, if you use an IDE such as Eclipse. (Shai Erera)
  • LUCENE-4750: Convert DrillDown to DrillDownQuery, so you can initialize it and add drill-down categories to it. (Michael McCandless, Shai Erera)
  • LUCENE-4759: remove FacetRequest.SortBy; result categories are always sorted by value, while ties are broken by category ordinal. (Shai Erera)
  • LUCENE-4772: Facet associations moved to new FacetsAggregator API. You should override FacetsAccumualtor and return the relevant aggregator, for aggregating the association values. (Shai Erera)
  • LUCENE-4748: A FacetRequest on a non-existent field now returns an empty FacetResult instead of skipping it. (Shai Erera, Mike McCandless)
  • LUCENE-4806: The default category delimiter character was changed from U+F749 to U+001F, since the latter uses 1 byte vs 3 bytes for the former. Existing facet indices must be reindexed. (Robert Muir, Shai Erera, Mike McCandless)
  • Optimizations:
  • LUCENE-4687: BloomFilterPostingsFormat now lazily initializes delegate TermsEnum only if needed to do a seek or get a DocsEnum. (Simon Willnauer)
  • LUCENE-4677, LUCENE-4682: unpacked FSTs now use vInt to encode the node target, to reduce their size (Mike McCandless)
  • LUCENE-4678: FST now uses a paged byte[] structure instead of a single byte[] internally, to avoid large memory spikes during building (James Dyer, Mike McCandless)
  • LUCENE-3298: FST can now be larger than 2.1 GB / 2.1 B nodes. (James Dyer, Mike McCandless)
  • LUCENE-4690: Performance improvements and non-hashing versions of NumericUtils.*ToPrefixCoded() (yonik)
  • LUCENE-4715: CategoryListParams.getOrdinalPolicy now allows to return a different OrdinalPolicy per dimension, to better tune how you index facets. Also added OrdinalPolicy.ALL_BUT_DIMENSION. (Shai Erera, Michael McCandless)
  • LUCENE-4740: Don't track clones of MMapIndexInput if unmapping is disabled. This reduces GC overhead. (Kristofer Karlsson, Uwe Schindler)
  • LUCENE-4733: The default Lucene 4.2 codec now uses a more compact TermVectorsFormat (Lucene42TermVectorsFormat) based on CompressingTermVectorsFormat. (Adrien Grand)
  • LUCENE-3729: The default Lucene 4.2 codec now uses a more compact DocValuesFormat (Lucene42DocValuesFormat). Sorted values are stored in an FST, Numerics and Ordinals use a number of strategies (delta-compression, table-compression, etc), and memory addresses use MonotonicBlockPackedWriter. (Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
  • LUCENE-4792: Reduction of the memory required to build the doc ID maps used when merging segments. (Adrien Grand)
  • LUCENE-4794: Spatial RecursivePrefixTreeStrategy's search filter: Skip calls to termsEnum.seek() when the next term is known to follow the current cell. (David Smiley)
  • New Features:
  • LUCENE-4686: New specialized DGapVInt8IntEncoder for facets (now the default). (Shai Erera)
  • LUCENE-4703: Add simple PrintTaxonomyStats tool to see summary information about the facets taxonomy index. (Mike McCandless)
  • LUCENE-4599: New oal.codecs.compressing.CompressingTermVectorsFormat which compresses term vectors into chunks of documents similarly to CompressingStoredFieldsFormat. (Adrien Grand)
  • LUCENE-4695: Added LiveFieldValues utility class, for getting the current (live, real-time) value for any indexed doc/field. The class buffers recently indexed doc/field values until a new near-real-time reader is opened that contains those changes. (Robert Muir, Mike McCandless)
  • LUCENE-4723: Add AnalyzerFactoryTask to benchmark, and enable analyzer creation via the resulting factories using NewAnalyzerTask. (Steve Rowe)
  • LUCENE-4728: Unknown and not explicitly mapped queries are now rewritten against the highlighting IndexReader to obtain primitive queries before discarding the query entirely. WeightedSpanTermExtractor now builds a MemoryIndex only once even if multiple fields are highlighted. (Simon Willnauer)
  • LUCENE-4035: Added ICUCollationDocValuesField, more efficient support for Locale-sensitive sort and range queries for single-valued fields. (Robert Muir)
  • LUCENE-4547: Added MonotonicBlockPacked(Reader/Writer), which provide efficient random access to large amounts of monotonically increasing positive values (e.g. file offsets). Each block stores the minimum value and the average gap, and values are encoded as signed deviations from the expected value. (Adrien Grand)
  • LUCENE-4547: Added AppendingLongBuffer, an append-only buffer that packs signed long values in memory and provides an efficient iterator API. (Adrien Grand)
  • LUCENE-4540: It is now possible for a codec to represent norms with less than 8 bits per value. For performance reasons this is not done by default, but you can customize your codec (e.g. pass PackedInts.DEFAULT to Lucene42DocValuesConsumer) if you want to make this tradeoff. (Adrien Grand, Robert Muir)
  • LUCENE-4764: A new Facet42Codec and Facet42DocValuesFormat provide faster but more RAM-consuming facet performance. (Shai Erera, Mike McCandless)
  • LUCENE-4769: Added OrdinalsCache and CachedOrdsCountingFacetsAggregator which uses the cache to obtain a document's ordinals. This aggregator is faster than others, however consumes much more RAM. (Michael McCandless, Shai Erera)
  • LUCENE-4778: Add a getter for the delegate in RateLimitedDirectoryWrapper. (Mark Miller)
  • LUCENE-4765: Add a multi-valued docvalues type (SORTED_SET). This is equivalent to building a FieldCache.getDocTermOrds at index-time. (Robert Muir)
  • LUCENE-4780: Add MonotonicAppendingLongBuffer: an append-only buffer for monotonically increasing values. (Adrien Grand)
  • LUCENE-4748: Added DrillSideways utility class for computing both drill-down and drill-sideways counts for a DrillDownQuery. (Mike McCandless)
  • API Changes:
  • LUCENE-4709: FacetResultNode no longer has a residue field. (Shai Erera)
  • LUCENE-4716: DrillDown.query now takes Occur, allowing to specify if categories should be OR'ed or AND'ed. (Shai Erera)
  • LUCENE-4695: ReferenceManager.RefreshListener.afterRefresh now takes a boolean indicating whether a new reference was in fact opened, and a new beforeRefresh method notifies you when a refresh attempt is starting. (Robert Muir, Mike McCandless)
  • LUCENE-4794: Spatial RecursivePrefixTreeFilter replaced by IntersectsPrefixTreeFilter and some extensible base classes. (David Smiley)
  • Bug Fixes:
  • LUCENE-4705: Pass on FilterStrategy in FilteredQuery if the filtered query is rewritten. (Simon Willnauer)
  • LUCENE-4712: MemoryIndex#normValues() throws NPE if field doesn't exist. (Simon Willnauer, Ricky Pritchett)
  • LUCENE-4550: Shapes wider than 180 degrees would use too much accuracy for the PrefixTree based SpatialStrategy. For a pathological case of nearly 360 degrees and barely any height, it would generate so many indexed terms (> 500k) that it could even cause an OutOfMemoryError. Fixed. (David Smiley)
  • LUCENE-4704: Make join queries override hashcode and equals methods. (Martijn van Groningen)
  • LUCENE-4724: Fix bug in CategoryPath which allowed passing null or empty string components. This is forbidden now (throws an exception). Note that if you have a taxonomy index created with such strings, you should rebuild it. (Michael McCandless, Shai Erera)
  • LUCENE-4732: Fixed TermsEnum.seekCeil/seekExact on term vectors. (Adrien Grand, Robert Muir)
  • LUCENE-4739: Fixed bugs that prevented FSTs more than ~1.1GB from being saved and loaded (Adrien Grand, Mike McCandless)
  • LUCENE-4717: Fixed bug where Lucene40DocValuesFormat would sometimes write an extra unused ordinal for sorted types. The bug is detected and corrected on-the-fly for old indexes. (Robert Muir)
  • LUCENE-4547: Fixed bug where Lucene40DocValuesFormat was unable to encode segments that would exceed 2GB total data. This could happen in some surprising cases, for example if you had an index with more than 260M documents and a VAR_INT field. (Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
  • LUCENE-4775: Remove SegmentInfo.sizeInBytes() and make MergePolicy.OneMerge.totalBytesSize thread safe (Josh Bronson via Robert Muir, Mike McCandless)
  • LUCENE-4770: If spatial's TermQueryPrefixTreeStrategy was used to search indexed non-point shapes, then there was an edge case where a query should find a shape but it didn't. The fix is the removal of an optimization that simplifies some leaf cells into a parent. The index data for such a field is now ~20% larger. This optimization is still done for the query shape, and for indexed data for RecursivePrefixTreeStrategy. Furthermore, this optimization is enhanced to roll up beyond the bottom cell level. (David Smiley, Florian Schilling)
  • LUCENE-4790: Fix FieldCacheImpl.getDocTermOrds to not bake deletes into the cached datastructure. Otherwise this can cause inconsistencies with readers at different points in time. (Robert Muir)
  • LUCENE-4791: A conjunction of terms (ConjunctionTermScorer) scanned on the lowest frequency term instead of skipping, leading to potentially large performance impacts for many non-random or non-uniform term distributions. (John Wang, yonik)
  • LUCENE-4798: PostingsHighlighter's formatter sometimes didn't highlight matched terms. (Robert Muir)
  • LUCENE-4796, SOLR-4373: Fix concurrency issue in NamedSPILoader and AnalysisSPILoader when doing reload (e.g. from Solr). (Uwe Schindler, Hossman)
  • LUCENE-4802: Don't compute norms for drill-down facet fields. (Mike McCandless)
  • LUCENE-4804: PostingsHighlighter sometimes applied terms to the wrong passage, if they started exactly on a passage boundary. (Robert Muir)
  • Documentation:
  • LUCENE-4718: Fixed documentation of oal.queryparser.classic. (Hayden Muhl via Adrien Grand)
  • LUCENE-4784, LUCENE-4785, LUCENE-4786: Fixed references to deprecated classes SinkTokenizer, ValueSourceQuery and RangeQuery. (Hao Zhong via Adrien Grand)
  • Build:
  • LUCENE-4654: Test duration statistics from multiple test runs should be reused. (Dawid Weiss)
  • LUCENE-4636: Upgrade ivy to 2.3.0 (Shawn Heisey via Robert Muir)
  • LUCENE-4570: Use the Policeman Forbidden API checker, released separately from Lucene and downloaded via Ivy. (Uwe Schindler, Robert Muir)
  • LUCENE-4758: 'ant jar', 'ant compile', and 'ant compile-test' should recurse. (Steve Rowe)

New in Apache Lucene 4.1.0 (Jan 23, 2013)

  • Changes in backwards compatibility policy:
  • LUCENE-4514: Scorer's freq() method returns an integer value indicating the number of times the scorer matches the current document. Previously this was only sometimes the case, in some cases it returned a (meaningless) floating point value. Scorer now extends DocsEnum so it has attributes().
  • LUCENE-4543: TFIDFSimilarity's index-time computeNorm is now final to match the fact that its query-time norm usage requires a FIXED_8 encoding. Override lengthNorm and/or encode/decodeNormValue to change the specifics, like Lucene 3.x.
  • LUCENE-3441: The facet module now supports NRT. As a result, the following changes were made:
  • DirectoryTaxonomyReader has a new constructor which takes a DirectoryTaxonomyWriter. You should use that constructor in order to get the NRT support (or the old one for non-NRT).
  • TaxonomyReader.refresh() removed in exchange for TaxonomyReader.openIfChanged static method. Similar to DirectoryReader, the method either returns null if no changes were made to the taxonomy, or a new TR instance otherwise. Instead of calling refresh(), you should write similar code to how you reopen a regular DirectoryReader.
  • TaxonomyReader.openIfChanged (previously refresh()) no longer throws InconsistentTaxonomyException, and supports recreate. InconsistentTaxoEx was removed.
  • ChildrenArrays was pulled out of TaxonomyReader into a top-level class.
  • TaxonomyReader was made an abstract class (instead of an interface), with methods such as close() and reference counting management pulled from DirectoryTaxonomyReader, and made final. The rest of the methods, remained abstract.
  • LUCENE-4576: Remove CachingWrapperFilter(Filter, boolean). This recacheDeletes option gave less than 1% speedup at the expense of cache churn (filters were invalidated on reopen if even a single delete was posted against the segment).
  • LUCENE-4575: Replace IndexWriter's commit/prepareCommit versions that take commitData with setCommitData(). That allows committing changes to IndexWriter even if the commitData is the only thing that changes.
  • LUCENE-4565: TaxonomyReader.getParentArray and .getChildrenArrays consolidated into one getParallelTaxonomyArrays(). You can obtain the 3 arrays that the previous two methods returned by calling parents(), children() or siblings() on the returned ParallelTaxonomyArrays.
  • LUCENE-4585: Spatial PrefixTree based Strategies (either TermQuery or RecursivePrefix based) MAY want to re-index if used for point data. If a re-index is not done, then an indexed point is ~1/2 the smallest grid cell larger and as such is slightly more likely to match a query shape.
  • LUCENE-4604: DefaultOrdinalPolicy removed in favor of OrdinalPolicy.ALL_PARENTS. Same for DefaultPathPolicy (now PathPolicy.ALL_CATEGORIES). In addition, you can use OrdinalPolicy.NO_PARENTS to never write any parent category ordinal to the fulltree posting payload (but note that you need a special FacetsAccumulator - see javadocs).
  • LUCENE-4594: Spatial PrefixTreeStrategy no longer indexes center points of non-point shapes. If you want to call makeDistanceValueSource() based on shape centers, you need to do this yourself in another spatial field.
  • LUCENE-4615: Replace IntArrayAllocator and FloatArrayAllocator by ArraysPool. FacetArrays no longer takes those allocators; if you need to reuse the arrays, you should use ReusingFacetArrays.
  • LUCENE-4621: FacetIndexingParams is now a concrete class (instead of DefaultFIP). Also, the entire IndexingParams chain is now immutable. If you need to override a setting, you should extend the relevant class. Additionally, FacetSearchParams is now immutable, and requires all FacetRequests to specified at initialization time.
  • LUCENE-4647: CategoryDocumentBuilder and EnhancementsDocumentBuilder are replaced by FacetFields and AssociationsFacetFields respectively. CategoryEnhancement and AssociationEnhancement were removed in favor of a simplified CategoryAssociation interface, with CategoryIntAssociation and CategoryFloatAssociation implementations. NOTE: indexes that contain category enhancements/associations are not supported by the new code and should be recreated.
  • LUCENE-4659: Massive cleanup to CategoryPath API. Additionally, CategoryPath is now immutable, so you don't need to clone() it.
  • LUCENE-4670: StoredFieldsWriter and TermVectorsWriter have new finish* callbacks which are called after a doc/field/term has been completely added.
  • LUCENE-4620: IntEncoder/Decoder were changed to do bulk encoding/decoding. As a result, few other classes such as Aggregator and CategoryListIterator were changed to handle bulk category ordinals.
  • LUCENE-4683: CategoryListIterator and Aggregator are now per-segment. As such their implementations no longer take a top-level IndexReader in the constructor but rather implement a setNextReader.
  • New Features:
  • LUCENE-4226: New experimental StoredFieldsFormat that compresses chunks of documents together in order to improve the compression ratio.
  • LUCENE-4426: New ValueSource implementations (in lucene/queries) for DocValues fields.
  • LUCENE-4410: FilteredQuery now exposes a FilterStrategy that exposes how filters are applied during query execution.
  • LUCENE-4404: New ListOfOutputs (in lucene/misc) for FSTs wraps another Outputs implementation, allowing you to store more than one output for a single input. UpToTwoPositiveIntsOutputs was moved from lucene/core to lucene/misc.
  • LUCENE-3842: New AnalyzingSuggester, for doing auto-suggest using an analyzer. This can create powerful suggesters: if the analyzer remove stop words then "ghost chr..." could suggest "The Ghost of Christmas Past"; if SynonymFilter is used to map wifi and wireless network to hotspot, then "wirele..." could suggest "wifi router"; token normalization likes stemmers, accent removel, etc. would allow the suggester to ignore such variations.
  • LUCENE-4446: Lucene 4.1 has a new default index format (Lucene41Codec) that incorporates the previously experimental "Block" postings format for better search performance.
  • LUCENE-3846: New FuzzySuggester, like AnalyzingSuggester except it also finds completions allowing for fuzzy edits in the input string.
  • LUCENE-4515: MemoryIndex now supports adding the same field multiple times.
  • LUCENE-4489: Added consumeAllTokens option to LimitTokenCountFilter
  • LUCENE-4566: Add NRT/SearcherManager.RefreshListener/addListener to be notified whenever a new searcher was opened.
  • SOLR-4123: Add per-script customizability to ICUTokenizerFactory via rule files in the ICU RuleBasedBreakIterator format.
  • LUCENE-4590: Added WriteEnwikiLineDocTask - a benchmark task for writing Wikipedia category pages and non-category pages into separate line files. extractWikipedia.alg was changed to use this task, so now it creates two files.
  • LUCENE-4290: Added PostingsHighlighter to the highlighter module. It uses offsets from the postings lists to highlight documents.
  • LUCENE-4628: Added CommonTermsQuery that executes high-frequency terms in a optional sub-query to prevent slow queries due to "common" terms like stopwords.
  • API Changes:
  • LUCENE-4399: Deprecated AppendingCodec. Lucene's term dictionaries no longer seek when writing.
  • LUCENE-4479: Rename TokenStream.getTokenStream(IndexReader, int, String) to TokenStream.getTokenStreamWithOffsets, and return null on failure rather than throwing IllegalArgumentException.
  • LUCENE-4472: MergePolicy now accepts a MergeTrigger that provides information about the trigger of the merge ie. merge triggered due to a segment merge or a full flush etc.
  • Lucene-4415: TermsFilter is now immutable. All terms need to be provided as constructor argument.
  • LUCENE-4520: ValueSource.getSortField no longer throws IOExceptions
  • LUCENE-4537: RateLimiter is now separated from FSDirectory and exposed via RateLimitingDirectoryWrapper. Any Directory can now be rate-limited.
  • LUCENE-4591: CompressingStoredFields{Writer,Reader} now accept a segment suffix as a constructor parameter.
  • LUCENE-4605: Added DocsEnum.FLAG_NONE which can be passed instead of 0 as the flag to .docs() and .docsAndPositions().
  • LUCENE-4617: Remove FST.pack() method. Previously to make a packed FST, you had to make a Builder with willPackFST=true (telling it you will later pack it), create your fst with finish(), and then call pack() to get another FST. Instead just pass true for doPackFST to Builder and finish() returns a packed FST.
  • LUCENE-4663: Deprecate IndexSearcher.document(int, Set). This was not intended to be final, nor named document(). Use IndexSearcher.doc(int, Set) instead.
  • LUCENE-4684: Made DirectSpellChecker extendable.
  • Bug Fixes:
  • LUCENE-1822: BaseFragListBuilder hard-coded 6 char margin is too naive.
  • LUCENE-4468: Fix rareish integer overflows in Lucene41 postings format.
  • LUCENE-4486: Add support for ConstantScoreQuery in Highlighter.
  • LUCENE-4485: When CheckIndex terms, terms/docs pairs and tokens, these counts now all exclude deleted documents.
  • LUCENE-4479: Highlighter works correctly for fields with term vector positions, but no offsets.
  • SOLR-3906: JapaneseReadingFormFilter in romaji mode will return romaji even for out-of-vocabulary kana cases (e.g. half-width forms).
  • LUCENE-4504: Fix broken sort comparator in ValueSource.getSortField, used when sorting by a function query.
  • LUCENE-4511: TermsFilter might return wrong results if a field is not indexed or doesn't exist in the index.
  • LUCENE-4521: IndexWriter.tryDeleteDocument could return true (successfully deleting the document) but then on IndexWriter close/commit fail to write the new deletions, if no other changes happened in the IndexWriter instance.
  • LUCENE-4513: Fixed that deleted nested docs are scored into the parent doc when using ToParentBlockJoinQuery.
  • LUCENE-4534: Fixed WFSTCompletionLookup and Analyzing/FuzzySuggester to allow 0 byte values in the lookup keys.
  • LUCENE-4532: DirectoryTaxonomyWriter use a timestamp to denote taxonomy index re-creation, which could cause a bug in case machine clocks were not synced. Instead, it now tracks an 'epoch' version, which is incremented whenever the taxonomy is re-created, or replaced.
  • LUCENE-4544: Fixed off-by-1 in ConcurrentMergeScheduler that would allow 1+maxMergeCount merges threads to be created, instead of just maxMergeCount
  • LUCENE-4567: Fixed NullPointerException in analyzing, fuzzy, and WFST suggesters when no suggestions were added
  • LUCENE-4568: Fixed integer overflow in PagedBytes.PagedBytesData{In,Out}put.getPosition.
  • LUCENE-4581: GroupingSearch.setAllGroups(true) was failing to actually compute allMatchingGroups
  • LUCENE-4009: Improve TermsFilter.toString
  • LUCENE-4588: Benchmark's EnwikiContentSource was discarding last wiki document and had leaking threads in 'forever' mode.
  • LUCENE-4585: Spatial RecursivePrefixTreeFilter had some bugs that only occurred when shapes were indexed. In what appears to be rare circumstances, documents with shapes near a query shape were erroneously considered a match. In addition, it wasn't possible to index a shape representing the entire globe.
  • LUCENE-4595: EnwikiContentSource had a thread safety problem (NPE) in 'forever' mode
  • LUCENE-4587: fix WordBreakSpellChecker to not throw AIOOBE when presented with 2-char codepoints, and to correctly break/combine terms containing non-latin characters.
  • LUCENE-4596: fix a concurrency bug in DirectoryTaxonomyWriter.
  • LUCENE-4594: Spatial PrefixTreeStrategy would index center-points in addition to the shape to index if it was non-point, in the same field. But sometimes the center-point isn't actually in the shape (consider a LineString), and for highly precise shapes it could cause makeDistanceValueSource's cache to load parts of the shape's boundary erroneously too. So center points aren't indexed any more; you should use another spatial field.
  • LUCENE-4629: IndexWriter misses to delete documents if a document block is indexed and the Iterator throws an exception. Documents were only rolled back if the actual indexing process failed.
  • LUCENE-4608: Handle large number of requested fragments better.
  • LUCENE-4633: DirectoryTaxonomyWriter.replaceTaxonomy did not refresh its internal reader, which could cause an existing category to be added twice.
  • LUCENE-4461: If you added the same FacetRequest more than once, you would get inconsistent results.
  • LUCENE-4656: Fix regression in IndexWriter to work with empty TokenStreams that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute), e.g., oal.analysis.miscellaneous.EmptyTokenStream.
  • LUCENE-4660: ConcurrentMergeScheduler was taking too long to un-pause incoming threads it had paused when too many merges were queued up.
  • LUCENE-4662: Add missing elided articles and prepositions to FrenchAnalyzer's DEFAULT_ARTICLES list passed to ElisionFilter.
  • LUCENE-4671: Fix CharsRef.subSequence method.
  • LUCENE-4465: Let ConstantScoreQuery's Scorer return its child scorer.
  • Changes in Runtime Behavior:
  • LUCENE-4586: Change default ResultMode of FacetRequest to PER_NODE_IN_TREE. This only affects requests with depth>1. If you execute such requests and rely on the facet results being returned flat (i.e. no hierarchy), you should set the ResultMode to GLOBAL_FLAT.
  • Optimizations:
  • LUCENE-2221: oal.util.BitUtil was modified to use Long.bitCount and Long.numberOfTrailingZeros (which are intrinsics since Java 6u18) instead of pure java bit twiddling routines in order to improve performance on modern JVMs/hardware.
  • LUCENE-4509: Enable stored fields compression by default in the Lucene 4.1 default codec.
  • LUCENE-4536: PackedInts on-disk format is now byte-aligned (it used to be long-aligned), saving up to 7 bytes per array of values.
  • LUCENE-4512: Additional memory savings for CompressingStoredFieldsFormat.
  • LUCENE-4443: Lucene41PostingsFormat no longer writes unnecessary offsets into the skipdata.
  • LUCENE-4459: Improve WeakIdentityMap.keyIterator() to remove GCed keys from backing map early instead of waiting for reap(). This makes test failures in TestWeakIdentityMap disappear, too.
  • LUCENE-4473: Lucene41PostingsFormat encodes offsets more efficiently for low frequency terms (< 128 occurrences).
  • LUCENE-4462: DocumentsWriter now flushes deletes, segment infos and builds CFS files if necessary during segment flush and not during publishing. The latter was a single threaded process while now all IO and CPU heavy computation is done concurrently in DocumentsWriterPerThread.
  • LUCENE-4496: Optimize Lucene41PostingsFormat when requesting a subset of the postings data (via flags to TermsEnum.docs/docsAndPositions) to use ForUtil.skipBlock.
  • LUCENE-4497: Don't write PosVIntCount to the positions file in Lucene41PostingsFormat, as its always totalTermFreq % BLOCK_SIZE.
  • LUCENE-4498: In Lucene41PostingsFormat, when a term appears in only one document, Instead of writing a file pointer to a VIntBlock containing the doc id, just write the doc id.
  • LUCENE-4515: MemoryIndex now uses Byte/IntBlockPool internally to hold terms and posting lists. All index data is represented as consecutive byte/int arrays to reduce GC cost and memory overhead.
  • LUCENE-4538: DocValues now caches direct sources in a ThreadLocal exposed via SourceCache. Users of this API can now simply obtain an instance via DocValues#getDirectSource per thread.
  • LUCENE-4580: DrillDown.query variants return a ConstantScoreQuery with boost set to 0.0f so that documents scores are not affected by running a drill-down query.
  • LUCENE-4598: PayloadIterator no longer uses top-level IndexReader to iterate on the posting's payload.
  • LUCENE-4661: Drop default maxThreadCount to 1 and maxMergeCount to 2 in ConcurrentMergeScheduler, for faster merge performance on spinning-magnet drives
  • Documentation:
  • LUCENE-4483: Refer to BytesRef.deepCopyOf in Term's constructor that takes BytesRef.
  • Build:
  • LUCENE-4650: Upgrade randomized testing to version 2.0.8: make the test framework more robust under low memory conditions.
  • LUCENE-4603: Upgrade randomized testing to version 2.0.5: print forked JVM PIDs on heartbeat from hung tests
  • Upgrade randomized testing to version 2.0.4: avoid hangs on shutdown hooks hanging forever by calling Runtime.halt() in addition to Runtime.exit() after a short delay to allow graceful shutdown
  • LUCENE-4451: Memory leak per unique thread caused by RandomizedContext.contexts static map. Upgrade randomized testing to version 2.0.2
  • LUCENE-4589: Upgraded benchmark module's Nekohtml dependency to version 1.9.17, removing the workaround in Lucene's HTML parser for the Turkish locale.
  • LUCENE-4601: Fix ivy availability check to use typefound, so it works if called from another build file.

New in Apache Lucene 4.0.0 (Oct 12, 2012)

  • Highlights:
  • A new "Block" PostingsFormat offering improved search performance and index compression. This will likely become the default format in a future release.
  • All non-default codec implementations were moved to a separated codecs module. Just add lucene-codecs-4.0.0.jar to your classpath to test these out.
  • Payloads can be optionally stored on the term vectors.
  • Many bug fixes and optimizations.
  • Changes in backwards compatibility policy:
  • LUCENE-4392: Class org.apache.lucene.util.SortedVIntList has been removed.
  • LUCENE-4393: RollingCharBuffer has been moved to the o.a.l.analysis.util package of lucene-analysis-common.
  • New Features:
  • LUCENE-1888: Added the option to store payloads in the term vectors (IndexableFieldType.storeTermVectorPayloads()). Note that you must store term vector positions to store payloads.
  • LUCENE-3892: Add a new BlockPostingsFormat that bulk-encodes docs, freqs and positions in large (size 128) packed-int blocks for faster search performance. This was from Han Jiang's 2012 Google Summer of Code project
  • LUCENE-4323: Added support for an absolute maximum CFS segment size (in MiB) to LogMergePolicy and TieredMergePolicy.
  • LUCENE-4339: Allow deletes against 3.x segments for easier upgrading. Lucene3x Codec is still otherwise read-only, you should not set it as the default Codec on IndexWriter, because it cannot write new segments.
  • SOLR-3441: ElisionFilterFactory is now MultiTermAware
  • API Changes:
  • LUCENE-4391, LUCENE-4440: All methods of Lucene40Codec but getPostingsFormatForField are now final. To reuse functionality of Lucene40, you should extend FilterCodec and delegate to Lucene40 instead of extending Lucene40Codec.
  • LUCENE-4299: Added Terms.hasPositions() and Terms.hasOffsets(). Previously you had no real way to know that a term vector field had positions or offsets, since this can be configured on a per-field-per-document basis.
  • Removed DocsAndPositionsEnum.hasPayload() and simplified the contract of getPayload(). It returns null if there is no payload, otherwise returns the current payload. You can now call it multiple times per position if you want.
  • Removed FieldsEnum. Fields API instead implements Iterable and exposes Iterator, so you can iterate over field names with for (String field : fields) instead.
  • LUCENE-4152: added IndexReader.leaves(), which lets you enumerate the leaf atomic reader contexts for all readers in the tree.
  • LUCENE-4304: removed PayloadProcessorProvider. If you want to change payloads (or other things) when merging indexes, its recommended to just use a FilterAtomicReader + IndexWriter.addIndexes. See the OrdinalMappingAtomicReader and TaxonomyMergeUtils in the facets module if you want an example of this.
  • LUCENE-4304: Make CompositeReader.getSequentialSubReaders() protected. To get atomic leaves of any IndexReader use the new method leaves() (LUCENE-4152), which lists AtomicReaderContexts including the doc base of each leaf.
  • LUCENE-4307: Renamed IndexReader.getTopReaderContext to IndexReader.getContext.
  • LUCENE-4316: Deprecate Fields.getUniqueTermCount and remove it from AtomicReader. If you really want the unique term count across all fields, just sum up Terms.size() across those fields. This method only exists so that this statistic can be accessed for Lucene 3.x segments, which don't support Terms.size().
  • LUCENE-4321: Change CharFilter to extend Reader directly, as FilterReader overdelegates (read(), read(char[], int, int), skip, etc). This made it hard to implement CharFilters that were correct. Instead only close() is delegated by default: read(char[], int, int) and correct(int) are abstract so that its obvious which methods you should implement. The protected inner Reader is 'input' like CharFilter in the 3.x series, instead of 'in'.
  • LUCENE-3309: The expert FieldSelector API, used to load only certain fields in a stored document, has been replaced with the simpler StoredFieldVisitor API.
  • LUCENE-4343: Made Tokenizer.setReader final. This is a setter that should not be overriden by subclasses: per-stream initialization should happen in reset().
  • LUCENE-4377: Remove IndexInput.copyBytes(IndexOutput, long). Use DataOutput.copyBytes(DataInput, long) instead.
  • LUCENE-4355: Simplify AtomicReader's sugar methods such as termDocsEnum, termPositionsEnum, docFreq, and totalTermFreq to only take Term as a parameter. If you want to do expert things such as pass a different Bits as liveDocs, then use the flex apis (fields(), terms(), etc) directly.
  • LUCENE-4425: clarify documentation of StoredFieldVisitor.binaryValue and simplify the api to binaryField(FieldInfo, byte[]).
  • Bug Fixes:
  • LUCENE-4423: DocumentStoredFieldVisitor.binaryField ignored offset and length.
  • LUCENE-4297: BooleanScorer2 would multiply the coord() factor twice for conjunctions: for most users this is no problem, but if you had a customized Similarity that returned something other than 1 when overlap == maxOverlap (always the case for conjunctions), then the score would be incorrect.
  • LUCENE-4298: MultiFields.getTermDocsEnum(IndexReader, Bits, String, BytesRef) did not work at all, it would infinitely recurse.
  • LUCENE-4300: BooleanQuery's rewrite was not always safe: if you had a custom Similarity where coord(1,1) != 1F, then the rewritten query would be scored differently.
  • Don't allow negatives in the positions file. If you have an index from 2.4.0 or earlier with such negative positions, and you already upgraded to 3.x, then to Lucene 4.0-ALPHA or -BETA, you should run CheckIndex. If it fails, then you need to upgrade again to 4.0
  • LUCENE-4303: PhoneticFilterFactory and SnowballPorterFilterFactory load their encoders / stemmers via the ResourceLoader now instead of Class.forName(). Solr users should now no longer have to embed these in its war.
  • SOLR-3737: StempelPolishStemFilterFactory loaded its stemmer table incorrectly. Also, ensure immutability and use only one instance of this table in RAM (lazy loaded) since its quite large.
  • LUCENE-4310: MappingCharFilter was failing to match input strings containing non-BMP Unicode characters.
  • LUCENE-4224: Add in-order scorer to query time joining and the out-of-order scorer throws an UOE.
  • LUCENE-4333: Fixed NPE in TermGroupFacetCollector when faceting on mv fields.
  • LUCENE-4218: Document.get(String) and Field.stringValue() again return values for numeric fields, like Lucene 3.x and consistent with the documentation.
  • NRTCachingDirectory was always caching a newly flushed segment in RAM, instead of checking the estimated size of the segment to decide whether to cache it.
  • LUCENE-3720: fix memory-consumption issues with BeiderMorseFilter.
  • LUCENE-4401: Fix bug where DisjunctionSumScorer would sometimes call score() on a subscorer that had already returned NO_MORE_DOCS.
  • LUCENE-4411: when sampling is enabled for a FacetRequest, its depth parameter is reset to the default (1), even if set otherwise.
  • LUCENE-4455: Fix bug in SegmentInfoPerCommit.sizeInBytes() that was returning 2X the true size, inefficiently. Also fixed bug in CheckIndex that would report no deletions when a segment has deletions, and vice/versa.
  • LUCENE-4456: Fixed double-counting sizeInBytes for a segment (affects how merge policies pick merges); fixed CheckIndex's incorrect reporting of whether a segment has deletions; fixed case where on abort Lucene could remove files it didn't create; fixed many cases where IndexWriter could leave leftover files (on exception in various places, on reuse of a segment name after crash and recovery.
  • Optimizations:
  • LUCENE-4322: Decrease lucene-core JAR size. The core JAR size had increased a lot because of generated code introduced in LUCENE-4161 and LUCENE-3892.
  • LUCENE-4317: Improve reuse of internal TokenStreams and StringReader in oal.document.Field.
  • LUCENE-4327: Support out-of-order scoring in FilteredQuery for higher performance.
  • LUCENE-4364: Optimize MMapDirectory to not make a mapping per-cfs-slice, instead one map per .cfs file. This reduces the total number of maps. Additionally factor out a (package-private) generic ByteBufferIndexInput from MMapDirectory.
  • Build:
  • LUCENE-4406, LUCENE-4407: Upgrade to randomizedtesting 2.0.1. Workaround for broken test output XMLs due to non-XML text unicode chars in strings. Added printing of failed tests at the end of a test run
  • LUCENE-4252: Detect/Fail tests when they leak RAM in static fields
  • LUCENE-4360: Support running the same test suite multiple times in parallel
  • LUCENE-3985: Upgrade to randomizedtesting 2.0.0. Added support for thread leak detection. Added support for suite timeouts.
  • LUCENE-4340: Move all non-default codec, postings format and terms dictionary implementations to lucene/codecs.
  • Documentation:
  • LUCENE-4302: Fix facet userguide to have HTML loose doctype like all other javadocs.

New in Apache Lucene 4.0.0 Beta (Aug 14, 2012)

  • Highlights:
  • IndexWriter.tryDeleteDocument can sometimes delete by document ID, for higher performance in some applications.
  • New experimental postings formats: BloomFilteringPostingsFormat uses a bloom filter to sometimes avoid disk seeks when looking up terms, DirectPostingsFormat holds all postings as simple byte[] and int[] for very fast performance at the cost of very high RAM consumption.
  • CJK analysis improvements: JapaneseIterationMarkCharFilter normalizes Japanese iteration marks, added unigram+bigram support to CJKBigramFilter.
  • Improvements to Scorer navigation API ( Scorer.getChildren) to support all queries, useful for determining which portions of the query matched.
  • Analysis improvements: factories for creating Tokenizer, TokenFilter, and CharFilter have been moved from Solr to Lucene's analysis module, less memory overhead for StandardTokenizer and Snowball filters.
  • Improved highlighting for multi-valued fields.
  • Various other API changes, optimizations and bug fixes.
  • New features:
  • LUCENE-4249: Changed the explanation of the PayloadTermWeight to use the underlying PayloadFunction's explanation as the explanation for the payload score. (Scott Smerchek via Robert Muir)
  • LUCENE-4069: Added BloomFilteringPostingsFormat for use with low-frequency terms such as primary keys (Mark Harwood, Mike McCandless)
  • LUCENE-4201: Added JapaneseIterationMarkCharFilter to normalize Japanese iteration marks. (Robert Muir, Christian Moen)
  • LUCENE-3832: Added BasicAutomata.makeStringUnion method to efficiently create automata from a fixed collection of UTF-8 encoded BytesRef (Dawid Weiss, Robert Muir)
  • LUCENE-4153: Added option to fast vector highlighting via BaseFragmentsBuilder to respect field boundaries in the case of highlighting for multivalued fields. (Martijn van Groningen)
  • LUCENE-4227: Added DirectPostingsFormat, to hold all postings in memory as uncompressed simple arrays. This uses a tremendous amount of RAM but gives good search performance gains. (Mike McCandless)
  • LUCENE-2510, LUCENE-4044: Migrated Solr's Tokenizer-, TokenFilter-, and CharFilterFactories to the lucene-analysis module. The API is still experimental. (Chris Male, Robert Muir, Uwe Schindler)
  • LUCENE-4230: When pulling a DocsAndPositionsEnum you can now specify whether or not you require payloads (in addition to offsets); turning one or both off may allow some codec implementations to optimize the enum implementation. (Robert Muir, Mike McCandless)
  • LUCENE-4203: Add IndexWriter.tryDeleteDocument(AtomicReader reader, int docID), to attempt deletion by docID as long as the provided reader is an NRT reader, and the segment has not yet been merged away (Mike McCandless).
  • LUCENE-4286: Added option to CJKBigramFilter to always also output unigrams. This can be used for a unigram+bigram approach, or at index-time only for better support of short queries. (Tom Burton-West, Robert Muir)
  • API Changes:
  • LUCENE-4138: update of morfologik (Polish morphological analyzer) to 1.5.3. The tag attribute class has been renamed to MorphosyntacticTagsAttribute and has a different API (carries a list of tags instead of a compound tag). Upgrade of embedded morfologik dictionaries to version 1.9. (Dawid Weiss)
  • LUCENE-4178: set 'tokenized' to true on FieldType by default, so that if you make a custom FieldType and set indexed = true, its analyzed by the analyzer. (Robert Muir)
  • LUCENE-4220: Removed the buggy JavaCC-based HTML parser in the benchmark module and replaced by NekoHTML. HTMLParser interface was cleaned up while changing method signatures. (Uwe Schindler, Robert Muir)
  • LUCENE-2191: Rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader). The purpose of this method was always to set a new Reader on the Tokenizer, reusing the object. But the name was often confused with TokenStream.reset(). (Robert Muir)
  • LUCENE-4228: Refactored CharFilter to extend java.io.FilterReader. CharFilters filter another reader and you override correct() for offset correction. (Robert Muir)
  • LUCENE-4240: Analyzer api now just takes fieldName for getOffsetGap. If the field is not analyzed (e.g. StringField), then the analyzer is not invoked at all. If you want to tweak things like positionIncrementGap and offsetGap, analyze the field with KeywordTokenizer instead. (Grant Ingersoll, Robert Muir)
  • LUCENE-4250: Pass fieldName to the PayloadFunction explain method, so it parallels with docScore and the default implementation is correct. (Robert Muir)
  • LUCENE-3747: Support Unicode 6.1.0. (Steve Rowe)
  • LUCENE-3884: Moved ElisionFilter out of org.apache.lucene.analysis.fr package into org.apache.lucene.analysis.util. (Robert Muir)
  • LUCENE-4230: When pulling a DocsAndPositionsEnum you now pass an int flags instead of the previous boolean needOffsets. Currently recognized flags are DocsAndPositionsEnum.FLAG_PAYLOADS and DocsAndPositionsEnum.FLAG_OFFSETS (Robert Muir, Mike McCandless)
  • LUCENE-4273: When pulling a DocsEnum, you can pass an int flags instead of the previous boolean needsFlags; consistent with the changes for DocsAndPositionsEnum in LUCENE-4230. Currently othe only flag is DocsEnum.FLAG_FREQS. (Robert Muir, Mike McCandless)
  • LUCENE-3616: TextField(String, Reader, Store) was reduced to TextField(String, Reader), as the Store parameter didn't make sense: if you supplied Store.YES, you would only receive an exception anyway. (Robert Muir)
  • Optimizations:
  • LUCENE-4171: Performance improvements to Packed64. (Toke Eskildsen via Adrien Grand)
  • LUCENE-4184: Performance improvements to the aligned packed bits impl. (Toke Eskildsen, Adrien Grand)
  • LUCENE-4235: Remove enforcing of Filter rewrite for NRQ queries. (Uwe Schindler)
  • LUCENE-4279: Regenerated snowball Stemmers from snowball r554, making them substantially more lightweight. Behavior is unchanged. (Robert Muir)
  • LUCENE-4291: Reduced internal buffer size for Jflex-based tokenizers such as StandardTokenizer from 32kb to 8kb. (Raintung Li, Steven Rowe, Robert Muir)
  • Bug Fixes:
  • LUCENE-4109: BooleanQueries are not parsed correctly with the flexible query parser. (Karsten Rauch via Robert Muir)
  • LUCENE-4176: Fix AnalyzingQueryParser to analyze range endpoints as bytes, so that it works correctly with Analyzers that produce binary non-UTF-8 terms such as CollationAnalyzer. (Nattapong Sirilappanich via Robert Muir)
  • LUCENE-4209: Fix FSTCompletionLookup to close its sorter, so that it won't leave temp files behind in /tmp. Fix SortedTermFreqIteratorWrapper to not leave temp files behind in /tmp on Windows. Fix Sort to not leave temp files behind when /tmp is a separate volume. (Uwe Schindler, Robert Muir)
  • LUCENE-4221: Fix overeager CheckIndex validation for term vector offsets. (Robert Muir)
  • LUCENE-4222: TieredMergePolicy.getFloorSegmentMB was returning the size in bytes not MB (Chris Fuller via Mike McCandless)
  • LUCENE-3505: Fix bug (Lucene 4.0alpha only) where boolean conjunctions were sometimes scored incorrectly. Conjunctions of only termqueries where at least one term omitted term frequencies (IndexOptions.DOCS_ONLY) would be scored as if all terms omitted term frequencies. (Robert Muir)
  • LUCENE-2686, LUCENE-3505: Fixed BooleanQuery scorers to return correct freq(). Added support for scorer navigation API (Scorer.getChildren) to all queries. Made Scorer.freq() abstract. (Koji Sekiguchi, Mike McCandless, Robert Muir)
  • LUCENE-4234: Exception when FacetsCollector is used with ScoreFacetRequest, and the number of matching documents is too large. (Gilad Barkai via Shai Erera)
  • LUCENE-4245: Make IndexWriter#close() and MergeScheduler#close() non-interruptible. (Mark Miller, Uwe Schindler)
  • LUCENE-4190: restrict allowed filenames that a codec may create to the patterns recognized by IndexFileNames. This also fixes IndexWriter to only delete files matching this pattern from an index directory, to reduce risk when the wrong index path is accidentally passed to IndexWriter (Robert Muir, Mike McCandless)
  • LUCENE-4277: Fix IndexWriter deadlock during rollback if flushable DWPT instance are already checked out and queued up but not yet flushed. (Simon Willnauer)
  • LUCENE-4282: Automaton FuzzyQuery didnt always deliver all results.(Johannes Christen, Uwe Schindler, Robert Muir)
  • LUCENE-4289: Fix minor idf inconsistencies/inefficiencies in highlighter. (Robert Muir)
  • Changes in Runtime Behavior:
  • LUCENE-4109: Enable position increments in the flexible queryparser by default. (Karsten Rauch via Robert Muir)
  • LUCENE-3616: Field throws exception if you try to set a boost on an unindexed field or one that omits norms. (Robert Muir)
  • Build:
  • LUCENE-4094: Support overriding file.encoding on forked test JVMs (force via -Drandomized.file.encoding=XXX). (Dawid Weiss)
  • LUCENE-4189: Test output should include timestamps (start/end for each test/ suite). Added -Dtests.timestamps=[off by default]. (Dawid Weiss)
  • LUCENE-4110: Report long periods of forked jvm inactivity (hung tests/ suites). Added -Dtests.heartbeat=[seconds] with the default of 60 seconds. (Dawid Weiss)
  • LUCENE-4160: Added a property to quit the tests after a given number of failures has occurred. This is useful in combination with -Dtests.iters=N (you can start N iterations and wait for M failures, in particular M = 1). -Dtests.maxfailures=M. Alternatively, specify -Dtests.failfast=true to skip all tests after the first failure. (Dawid Weiss)
  • LUCENE-4115: JAR resolution/ cleanup should be done automatically for ant clean/ eclipse/ resolve (Dawid Weiss)
  • LUCENE-4199, LUCENE-4202, LUCENE-4206: Add a new target "check-forbidden-apis" that parses all generated .class files for use of APIs that use default charset, default locale, or default timezone and fail build if violations found. This ensures, that Lucene / Solr is independent on local configuration options. (Uwe Schindler, Robert Muir, Dawid Weiss)
  • LUCENE-4217: Add the possibility to run tests with Atlassian Clover loaded from IVY. A development License solely for Apache code was added in the tools/ folder, but is not included in releases. (Uwe Schindler)
  • Documentation:
  • LUCENE-4195: Added package documentation and examples for org.apache.lucene.codecs (Alan Woodward via Robert Muir)

New in Apache Lucene 3.6.1 (Jul 23, 2012)

  • The concurrency of MMapIndexInput.clone() was improved, which caused a performance regression in comparison to Lucene 3.5.0.
  • MappingCharFilter was fixed to return correct final token positions.
  • QueryParser now supports +/- operators with any amount of whitespace.
  • DisjunctionMaxScorer now implements visitSubScorers().
  • Changed the visibility of Scorer#visitSubScorers() to public, otherwise it's impossible to implement Scorers outside the Lucene package. This is a small backwards break, affecting a few users who implemented custom Scorers.
  • Various analyzer bugs where fixed: Kuromoji to not produce invalid token graph due to UNK with punctuation being decompounded, invalid position length in SynonymFilter, loading of Hunspell dictionaries that use aliasing, be consistent with closing streams when loading Hunspell affix files.
  • Various bugs in FST components were fixed: Offline sorter minimum buffer size, integer overflow in sorter, FSTCompletionLookup missed to close its sorter.
  • Fixed a synchronization bug in handling taxonomies in facet module.
  • Various minor bugs were fixed: BytesRef/CharsRef copy methods with nonzero offsets and subSequence off-by-one, TieredMergePolicy returned wrong-scaled floor segment setting.

New in Apache Lucene 4.0.0 Alpha (Jul 4, 2012)

  • The index formats for terms, postings lists, stored fields, term vectors, etc are pluggable via the Codec api. You can select from the provided implementations or customize the index format with your own Codec to meet your needs.
  • Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided
  • Added support for per-document values (DocValues). DocValues can be used for custom scoring factors (accessible via Similarity), for pre-sorted Sort values, and more.
  • When indexing via multiple threads, each IndexWriter thread now flushes its own segment to disk concurrently, resulting in substantial performance improvements
  • Per-document normalization factors ("norms") are no longer limited to a single byte. Similarity implementations can use any DocValues type to store norms.
  • Added index statistics such as the number of tokens for a term or field, number of postings for a field, and number of documents with a posting for a field: these support additional scoring models
  • Implemented a new default term dictionary/index (BlockTree) that indexes shared prefixes instead of every n'th term. This is not only more time- and space- efficient, but can also sometimes avoid going to disk at all for terms that do not exist. Alternative term dictionary implementions are provided and pluggable via the Codec api.
  • Indexed terms are no longer UTF-16 char sequences, instead terms can be any binary value encoded as byte arrays. By default, text terms are now encoded as UTF-8 bytes. Sort order of terms is now defined by their binary value, which is identical to UTF-8 sort order.
  • Substantially faster performance when using a Filter during searching.
  • File-system based directories can rate-limit the IO (MB/sec) of merge threads, to reduce IO contention between merging and searching threads.
  • Added a number of alternative Codecs and components for different use-cases: "Appending" works with append-only filesystems (such as Hadoop DFS), "Memory" writes the entire terms+postings as an FST read into RAM (see http://blog.mikemccandless.com/2011/06/primary-key-lookups-are-28x-faster-with.html), "Pulsing" inlines the postings for low-frequency terms into the term dictionary (see http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html), "SimpleText" writes all files in plain-text for easy debugging/transparency
  • Term offsets can be optionally encoded into the postings lists and can be retrieved per-position.
  • A new AutomatonQuery returns all documents containing any term matching a provided finite-state automaton
  • FuzzyQuery is 100-200 times faster than in past releases
  • A new spell checker, DirectSpellChecker, finds possible corrections directly against the main search index without requiring a separate index.
  • Various in-memory data structures such as the term dictionary and FieldCache are represented more efficiently with less object overhead
  • All search logic is now required to work per segment, IndexReader was therefore refactored to differentiate between atomic and composite readers
  • Lucene 4.0 provides a modular API, consolidating components such as Analyzers and Queries that were previously scattered across Lucene core, contrib, and Solr. These modules also include additional functionality such as UIMA analyzer integration and a completely reworked spatial search implementation.

New in Apache Lucene 3.6.0 (Apr 13, 2012)

  • In addition to Java 5 and Java 6, this release has now full Java 7 support (minimum JDK 7u1 required).
  • TypeTokenFilter filters tokens based on their TypeAttribute.
  • Fixed offset bugs in a number of CharFilters, Tokenizers and TokenFilters that could lead to exceptions during highlighting.
  • Added phonetic encoders: Metaphone, Soundex, Caverphone, Beider-Morse, etc.
  • CJKBigramFilter and CJKWidthFilter replace CJKTokenizer.
  • Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation.
  • Static index pruning (Carmel pruning) removes postings with low within-document term frequency.
  • QueryParser now interprets '*' as an open end for range queries.
  • FieldValueFilter excludes documents missing the specified field.
  • CheckIndex and IndexUpgrader allow you to specify the specific FSDirectory implementation to use with the new -dir-impl command-line option.
  • FSTs can now do reverse lookup (by output) in certain cases and can be packed to reduce their size. There is now a method to retrieve top N shortest paths from a start node in an FST.
  • New WFSTCompletionLookup suggester supports finer-grained ranking for suggestions.
  • FST based suggesters now use an offline (disk-based) sort, instead of in-memory sort, when pre-sorting the suggestions.
  • ToChildBlockJoinQuery joins in the opposite direction (parent down to child documents).
  • New query-time joining is more flexible (but less performant) than index-time joins.
  • Added HTMLStripCharFilter to strip HTML markup.
  • Security fix: Better prevention of virtual machine SIGSEGVs when using MMapDirectory: Code using cloned IndexInputs of already closed indexes could possibly crash VM, allowing DoS attacks to your application.
  • Many bug fixes...

New in Apache Lucene 3.5.0 (Nov 28, 2011)

  • Added a very substantial (3-5X) RAM reduction required to hold the terms index on opening an IndexReader. (LUCENE-2205)
  • Added IndexSearcher.searchAfter which returns results after a specified ScoreDoc (e.g. last document on the previous page) to support deep paging use cases. (LUCENE-2215)
  • Added SearcherManager to manage sharing and reopening IndexSearchers across multiple search threads. Underlying IndexReader instances are safely closed if not referenced anymore. (LUCENE-3445, LUCENE-3558)
  • Added SearcherLifetimeManager which safely provides a consistent view of the index across multiple requests (e.g. paging/drilldown). (LUCENE-3558, LUCENE-3486)
  • Renamed IndexWriter.optimize to forceMerge to discourage use of this method since it is horribly costly and rarely justified anymore. (LUCENE-3454)
  • Added NGramPhraseQuery that speeds up phrase queries 30-50% when n-gram analysis is used. (LUCENE-3426)
  • Added a new reopen API (IndexReader.openIfChanged) that returns null instead of the old reader if there are no changes in the index. (LUCENE-3464)
  • Improvements to vector highlighting: support for more queries such as wildcards and boundary analysis for generated snippets. (LUCENE-1824, LUCENE-1889)
  • IndexSearcher and IndexReader now perform additional checks to throw AlreadyClosedExceptions if searches are performed on a closed IndexReader. Performing searches on already closed reader can cause JVM crashes when invalid memory mapped files are referenced.
  • Several bugfixes, including a bug where closing an NRT reader after the writer was closed was incorrectly invoking the DeletionPolicy. See CHANGES.txt entries for full details.

New in Apache Lucene 3.4.0 (Sep 15, 2011)

  • Fixed a major bug (LUCENE-3418) whereby a Lucene index could easily become corrupted if the OS or computer crashed or lost power.
  • Added a new faceting module (contrib/facet) for computing facet counts (both hierarchical and non-hierarchical) at search time (LUCENE-3079).
  • Added a new join module (contrib/join), enabling indexing and searching of nested (parent/child) documents using BlockJoinQuery/Collector (LUCENE-3171).
  • It is now possible to index documents with term frequencies included but without positions (LUCENE-2048); previously omitTermFreqAndPositions always omitted both.
  • The modular QueryParser (contrib/queryparser) can now create NumericRangeQuery.
  • Added SynonymFilter, in contrib/analyzers, to apply multi-word synonyms during indexing or querying, including parsers to read the wordnet and solr synonym formats (LUCENE-3233).
  • You can now control how documents that don't have a value on the sort field should sort (LUCENE-3390), using SortField.setMissingValue.
  • Fixed a case where term vectors could be silently deleted from the index after addIndexes (LUCENE-3402).

New in Apache Lucene 3.3.0 (Jul 2, 2011)

  • Changes in backwards compatibility policy:
  • LUCENE-3140: IndexOutput.copyBytes now takes a DataInput (superclass of IndexInput) as its first argument. (Robert Muir, Dawid Weiss, Mike McCandless)
  • LUCENE-3191: FieldComparator.value now returns an Object not Comparable; FieldDoc.fields also changed from Comparable[] to Object[] (Uwe Schindler, Mike McCandless)
  • LUCENE-3208: Made deprecated methods Query.weight(Searcher) and Searcher.createWeight() final to prevent override. If you have overridden one of these methods, cut over to the non-deprecated implementation. (Uwe Schindler, Robert Muir, Yonik Seeley)
  • LUCENE-3238: Made MultiTermQuery.rewrite() final, to prevent problems (such as not properly setting rewrite methods, or not working correctly with things like SpanMultiTermQueryWrapper). To rewrite to a simpler form, instead return a simpler enum from getEnum(IndexReader). For example, to rewrite to a single term, return a SingleTermEnum. (ludovic Boutros, Uwe Schindler, Robert Muir)
  • Changes in runtime behavior:
  • LUCENE-2834: the hash used to compute the lock file name when the lock file is not stored in the index has changed. This means you will see a different lucene-XXX-write.lock in your lock directory. (Robert Muir, Uwe Schindler, Mike McCandless) LUCENE-3146: IndexReader.setNorm throws IllegalStateException if the field does not store norms. (Shai Erera, Mike McCandless)
  • LUCENE-3198: On Linux, if the JRE is 64 bit and supports unmapping, FSDirectory.open now defaults to MMapDirectory instead of NIOFSDirectory since MMapDirectory gives better performance. (Mike McCandless) LUCENE-3200: MMapDirectory now uses chunk sizes that are powers of 2. When setting the chunk size, it is rounded down to the next possible value. The new default value for 64 bit platforms is 2^30 (1 GiB), for 32 bit platforms it stays unchanged at 2^28 (256 MiB). Internally, MMapDirectory now only uses one dedicated final IndexInput implementation supporting multiple chunks, which makes Hotspot's life easier. (Uwe Schindler, Robert Muir, Mike McCandless)
  • Bug fixes
  • LUCENE-3147,LUCENE-3152: Fixed open file handles leaks in many places in the code. Now MockDirectoryWrapper (in test-framework) tracks all open files, including locks, and fails if the test fails to release all of them. (Mike McCandless, Robert Muir, Shai Erera, Simon Willnauer)
  • LUCENE-3102: CachingCollector.replay was failing to call setScorer per-segment (Martijn van Groningen via Mike McCandless)
  • LUCENE-3183: Fix rare corner case where seeking to empty term (field="", term="") with terms index interval 1 could hit ArrayIndexOutOfBoundsException (selckin, Robert Muir, Mike McCandless)
  • LUCENE-3208: IndexSearcher had its own private similarity field and corresponding get/setter overriding Searcher's implementation. If you setted a different Similarity instance on IndexSearcher, methods implemented in the superclass Searcher were not using it, leading to strange bugs. (Uwe Schindler, Robert Muir)
  • LUCENE-3197: Fix core merge policies to not over-merge during background optimize when documents are still being deleted concurrently with the optimize (Mike McCandless)
  • LUCENE-3222: The RAM accounting for buffered delete terms was failing to measure the space required to hold the term's field and text character data. (Mike McCandless)
  • LUCENE-3238: Fixed bug where using WildcardQuery("prefix*") inside of a SpanMultiTermQueryWrapper rewrote incorrectly and returned an error instead. (ludovic Boutros, Uwe Schindler, Robert Muir)
  • API Changes:
  • LUCENE-3208: Renamed protected IndexSearcher.createWeight() to expert public method IndexSearcher.createNormalizedWeight() as this better describes what this method does. The old method is still there for backwards compatibility. Query.weight() was deprecated and simply delegates to IndexSearcher. Both deprecated methods will be removed in Lucene 4.0. (Uwe Schindler, Robert Muir, Yonik Seeley)
  • LUCENE-3197: MergePolicy.findMergesForOptimize now takes Map instead of Set as the second argument, so the merge policy knows which segments were originally present vs produced by an optimizing merge (Mike McCandless)
  • Optimizations:
  • LUCENE-1736: DateTools.java general improvements. (David Smiley via Steve Rowe)
  • New Features:
  • LUCENE-3140: Added experimental FST implementation to Lucene. (Robert Muir, Dawid Weiss, Mike McCandless)
  • LUCENE-3193: A new TwoPhaseCommitTool allows running a 2-phase commit algorithm over objects that implement the new TwoPhaseCommit interface (such as IndexWriter). (Shai Erera)
  • LUCENE-3191: Added TopDocs.merge, to facilitate merging results from different shards (Uwe Schindler, Mike McCandless)
  • LUCENE-3179: Added OpenBitSet.prevSetBit (Paul Elschot via Mike McCandless)
  • LUCENE-3210: Made TieredMergePolicy more aggressive in reclaiming segments with deletions; added new methods set/getReclaimDeletesWeight to control this. (Mike McCandless)
  • Build:
  • LUCENE-1344: Create OSGi bundle using dev-tools/maven. (Nicolas Lalev√©e, Luca Stancapiano via ryan) LUCENE-3204: The maven-ant-tasks jar is now included in the source tree; users of the generate-maven-artifacts target no longer have to manually place this jar in the Ant classpath. NOTE: when Ant looks for the maven-ant-tasks jar, it looks first in its pre-existing classpath, so any copies it finds will be used instead of the copy included in the Lucene/Solr source tree. For this reason, it is recommeded to remove any copies of the maven-ant-tasks jar in the Ant classpath, e.g. under ~/.ant/lib/ or under the Ant installation's lib/ directory. (Steve Rowe)