Norconex Importer Changelog

What's new in Norconex Importer 2.6.0

Oct 26, 2016
  • New CountMatchesTagger that will count occurrences of matching substring or regular expression in a field value or document content and store the count in a target field.
  • DateFormatTagger now accepts multiple source formats when attempting to convert dates, trying them in order provided.
  • DOMTagger can now apply DOM selection on an optional "fromField" and can also use a "defaultValue" when there is no match.
  • New DOM selector possibility for DOMContentFilter and DOMTagger: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).
  • TranslatorSplitter now supports Yandex translation service.
  • GenericDocumentParserFactory/AbstractTikaParser now allows you to control which embedded documents you do not want extracted from their containers.
  • GenericDocumentParserFactory/AbstractTikaParser now allows you to control which documents containers you do not want to extract their embedded documents.
  • GenericDocumentParserFactory/AbstractTikaParser now allows you to specify which content types to "split" their embedded documents via regular expression.
  • GenericDocumentParserFactory now allows you to define and configure parsers via XML.
  • New IHintsAwareParser interface for parsers that can benefit from global configuration settings.
  • New ParseHints class holding generic configuration settings to be set on parsers implementing the new IHintsAwareParser.
  • New EmbeddedConfig class holding configuration settings related to embedded documents. Used by ParseHints on GenericDocumentParserFactory.
  • Can now pass optional -e or --contentEncoding to command line to explicitly set the character encoding (charset).
  • LanguageTagger now uses Tika language detection (supports at least 70 languages).
  • GenericDocumentParserFactory has been modified to introduce the concept of ParseHints which holds configuration settings every parsers have the option to support or not. Generic embedded and OCR configuration settings have been moved to the new ParseHints class.
  • The following GenericDocumentParserFactory method are now deprecated: setSplitEmbedded(boolean), isSplitEmbedded(), setOCRConfig(OCRConfig), and getOCRConfig().
  • It is now possible to configure ExternalParser via XML.
  • Now validates configuration and variable file paths when launched on the command line (throws errors on invalid paths).
  • Dependency updates: Tika 1.13 (which now uses PDFBox 2.x), Norconex Commons Lang 1.9.1, JSoup 1.9.2.
  • OCRConfig#setContentTypes(String) and equivalent configuration option in GenericDocumentParserFactory now expects a regular expression as opposed to a coma-separated list of content types.
  • DebugTagger now assumes UTF-8 instead of OS default charset when printing content.
  • Subclasses of AbstractStringTagger will now see tagTextDocument(...) method invoked at least once even if there is no content supplied.
  • Fixed DOMTagger ignoring subsequent selectors when one selector has no match.
  • Fixed ContentTypeDetector not closing TikaInputStream properly resulting in temporary "apache-tika-XXX.tmp" files not being deleted properly.
  • Fixed infinite loop with DOMSplitter when some selectors are too generic. AbstractCharStreamTagger now tolerates null content stream.

New in Norconex Importer 2.3.1 (Aug 11, 2015)

  • Maven dependency updates: Norconex Commons Lang 1.7.0.

New in Norconex Importer 2.3.0 (Aug 11, 2015)

  • New TextPatternTagger for extracting text matching regular expressions out of a document content and storing matches into a field.
  • New unit tests created for it.
  • Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
  • Javadoc fixes and updates.
  • Library updates: Norconex Commons Lang 1.6.2.
  • Fixed NullPointerException in DebugTagger when a field contains a null value.

New in Norconex Importer 2.2.0 (Jun 16, 2015)

  • New DocumentLengthTagger for adding the document byte length as a field to imported documents.
  • New CurrentDateTagger for adding the current date as a field to imported documents.
  • New NumericMetadataFilter for filtering documents based on whether a numeric field value matches a given numeric range.
  • New DateMetadataFilter for filtering documents based on whether a date field value matches a given date range.
  • New ExternalParser class which is used to run an external process for parsing files (e.g. pdftotext) of the associated content type.
  • By default PDF parsing is now done with this flag set to true: "suppressDuplicateOverlappingText". This should eliminate the extraction of duplicate text in PDF where bolding is done by having multiple instance of the same string on top of each other.
  • Complete rewrite of AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer to limit the memory taken for loading the content. Now the memory is specified in absolute terms instead of dynamically allocating it based on free memory (an approach that could cause OutOfMemory errors). All subclasses now accept a "maxReadSize" configuration option to set the maximum number of characters to process at once.
  • The abstract methods accepting a "partial" boolean argument on AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer have been changed to now accept a "sectionIndex" integer, representing the document content section being processed. Only larger documents will be processed one section of text at a time (to preserve memory).
  • AbstractCharStreamTransformer#transformTextDocument(...) now throws an ImporterHandlerException instead of IOException to be consistent with other handlers.
  • TitleGeneratorTagger was re-written no longer uses Carrot, to reduce library dependencies.
  • Removed custom Tika mappings for Microsoft Visio now that they have been added to default Tika mappings in Tika 1.8.
  • Reference: https://issues.apache.org/jira/browse/TIKA-1286
  • ReplaceTagger: now case insensitive by default. Added a new flag to turn case-sensitivity on/off. #addReplacement(...) methods have been deprecated in favor of addReplacement(Replacement).
  • Regular expressions in RegexContentFilter, RegexMetadataFilter, ReplaceTagger, TextBetweenTagger, ReplaceTransformer, StripAfterTransformer, StripBeforeTransformer, and StripBetweenTransformer now always have the Pattern.DOTALL flags enabled and when case sensitivity is enabled for regex, Pattern.UNICODE_CASE is now always used.
  • Library updates: Apache Tika 1.8, Norconex Commons Lang 1.6.1,
  • Apache Commons CLI 1.3, Apache Jempbox 1.8.9, Jempbox 2.0.0.
  • Removed these library "direct" dependencies: Carrot2 (3.9.4), Lucene Analyzers (5.0.0), and Stax2 API (3.1.4).
  • Javadoc fixes and updates.
  • New unit tests to cover all filter onMatch use cases.
  • Fixed filters not working properly when using onMatch="include". Affects all subclasses of AbstractDocumentFilter, which now details the include/exclude logic in its Javadoc (github collector-http#108).
  • Fixed "Too many open files" exception.
  • Fixed the "restrictTo" feature not always working for AbstractImporterHandler subclasses.

New in Norconex Importer 2.1.1 (Apr 9, 2015)

  • PDFBox now uses latest snapshot (as opposed to a frozen one).
  • Javadoc fixes.
  • Library updates: SLF4J 1.7.12.

New in Norconex Importer 2.1.0 (Apr 1, 2015)

  • Added OCR support using Tesseract open-source product. Configured by setting an OCRConfig to GenericDocumentParserFactory.
  • Added document translation support with the new TranslatorSplitter. Support these translation APIs: Microsoft, Google, Lingo24, and Moses. Both the document content and/or chosen fields can be translated.
  • New TitleGeneratorTagger to dynamically generate titles out of documents, using Carrot2 to extract the best terms.
  • New EnhancedPDFParser and EnhancedPDF2XHTML classes modifying original Tika PDFParser to add support for PDF XFA (dynamic forms) text extraction as well as adding support for PDFBox 2.0.0 (which fixes the striping of space characters between words in many PDFs).
  • New XFDLParser for parsing PureEdge Extensible Forms Description Language files (XFDL). Supports both Gzipped+Base64 and plain text versions.
  • New WordPerfectParser class for parsing WordPerfect documents according to WordPerfect file specifications.
  • New QuattroProParser class for parsing QuattroPro documents according to QuattroPro file specifications.
  • New configuration "parseErrorsSaveDir" on importer configuration for saving files that caused parsing errors along with their exception and metadata if any.
  • KeepOnlyTagger and DeleteTagger now supports regular expression for identifying fields to keep/delete. The field="" attribute has been replaced by a element.
  • Added support for JBIG2 and jpeg2000 image formats.
  • Improved content detection of MS Office and Corel Office documents when importing an input stream with no specified extension.
  • Improved overall content detection accuracy and performance.
  • Default allocated memory for caching of document content was increased by a factor of 10 (10MB max per document, 100MB max total).
  • AbstractTikaParser can now be extended to modify Tika ParseContext.
  • importer.bat and importer.sh will now load the log4j.properties from the ./classes folder.
  • Now always flush output stream from parsers so implementors do not have to be concerned with this.
  • Easier to extend GenericDocumentParserFactory to provide custom parsers. Dropped "registerNamedParser", "registerFallbackParser", and "getFallbackParser" methods in favor of new "createFallbackParser" and "createNamedParsers" methods.
  • HTMLParser and PDFParser are now deprecated. HTML and PDF are now handled by the fall-back parser (auto-detected).
  • IDocumentSplittableEmbeddedParser is now deprecated and has no effect. Will be deleted in a future release.
  • Minor javadoc improvements and fixes.
  • No longer adds null handlers (possible when configuration loading failed for an handler).
  • Improved exception handling for configuration loading.
  • Library updates: Tika 1.7, Norconex Commons Lang 1.6.0, JUnit 4.12, PDFBox 2.0.0 (SNAPSHOT-2015-03-28), Apache Commons Codec 1.10, Lucene Analyzer Common 5.0.0.
  • Updated several maven plugins and added SonarQube maven plugin.
  • Added Sonatype repository to pom.xml for snapshot releases.
  • Added more unit tests for various content type parsing.
  • Fixed embedded objects not always having the right content-type.
  • Fixed invalid mapping between "application/wordperfect" content type and WordPerfectParser.
  • Fixed AbstractCharStreamTagger subclasses badly detecting character encoding and failing documents as a consequence.

New in Norconex Importer 2.0.0 (Nov 27, 2014)

  • Importing now returns an ImporterResponse, which may hold the imported document, along with nested documents, and and ImporterStatus.
  • New IDocumentSplitter handler and related classes, allowing implementations to split documents into more documents.
  • DefaultDocumentParserFactory can now be configured to treat embedded documents as distinct documents (committed separately). Parsers can now implement IDocumentSplittableEmbeddedParser to indicate they are supporting document splitting.
  • DefaultDocumentParserFactory can now ignore parsing specified content-types.
  • New IImporterResponseProcessor to process the import response.
  • Document encoding can now be explicitly specified when importing and the value get stored as a metadata field.
  • New ContentTypeDetector for detecting the content-type from documents.
  • New ImporterDocument, holding all objects related to a document being imported.
  • New ImporterMetadata, extending Properties to provide additional import-related convenience methods and constants.
  • New CsvSplitter class for splitting coma-separated value files into multiple records/documents to be indexed.
  • New RegexContentFilter for accepting/rejecting documents based on a successful regular expression match on their content.
  • New CharacterCaseTagger for modifying the character case of a metadata field value.
  • New DateFormatTagger for parsing/formatting date from specified metadata fields.
  • New DebugTagger for logging document content and/or metadata to help with implementation and troubleshooting.
  • New LanguageTagger which analyzes a document content to automatically detect and store as metadata the document language.
  • New TextStatisticsTagger that stores as metadata statistical information about a document content (word count, average words per sentences, etc.).
  • New AbstractDocument* class for each types of handlers, facilitating handler implementation.
  • Directory where temporary files are created is now configurable.
  • Added support for parsing .iso files.
  • Now licensed under The Apache License, Version 2.0.
  • Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used.
  • This reduces I/O and improves performance.
  • Now every handlers except filters can be restricted to matching metadata values (configurable).
  • *.tagger, *.filter, and *.transformer handlers were move to *.handler.tagger, *.handler.filter, and *.handler.transformer.
  • com.norconex.importer.ContentType has been replaced with com.norconex.commons.lang.file.ContentType.
  • For consistency, several references to metadata field names were renamed to use the term "field" (instead of property or else).
  • DefaultDocumentParserFactory was renamed to GenericDocumentParserFactory.
  • Handler "contentTypeRegex" tag was removed from handlers that supported it in favor of the more flexible "restrictTo" tag(s).

New in Norconex Importer 1.3.0 (Aug 19, 2014)

  • Now stores the content "family" for each documents as "importer.contentFamily". This is a higher level representation of a file content types.
  • New SplitTagger: Split values into multiple-values using a separator of choice.
  • New CopyTagger: copies document metadata fields to other fields.
  • New HierarchyTagger: splits a field string into multiple segments representing each node of a hierarchical branch.
  • Improved detection of certain mime types, such as those previously appearing as application/x-tika-*.
  • ReplaceTagger now supports regular expressions (via a new "regex" flag).
  • Can now detect these MS Viso mime-types properly: vsdx, vstc, vssx, vsdm, vstm, vssm.
  • AbstractCharStreamTransformer now enforces streaming as UTF8.
  • Now requires Java 7 or higher.
  • RelpaceTagger regular matching now only replaces matching "fromValue".

New in Norconex Importer 1.2.0 (Mar 14, 2014)

  • Now extracts text from WordPerfect documents (new WordPerfectParser class).
  • New transformer "ReduceConsecutivesTransformer" to reduce consecutive instances of the same string to only one instance.
  • New transformer "ReplaceTransformer" to perform search and replace on document content using regular expression.
  • New filter "EmptyMetadataFilter" to exclude/include documents with no data for one or more specified metadata properties.
  • Library updates: Tika 1.5, Norconex Commons Lang 1.3.0.
  • Now attempts to detect the character encoding from a character stream by looking at a Content-Type metadata. If none is present, defaults to UTF-8.
  • Fixed NPE in AbstractTextRestrictiveHandler when no content-type is found when used before parsing.

New in Norconex Importer 1.0.1 (Aug 5, 2013)

  • Upgraded Apache Tika from 1.3 to 1.4.
  • Removed dependency on aspectjrt due to GPL licensing incompatibility. If you need .iso parsing, you can manually download and add to the classpath.