NLTK Changelog

What's new in NLTK 2.0.1 rc1

Apr 11, 2011
  • added interface to the Stanford POS Tagger
  • updates to sem.Boxer, sem.drt.DRS
  • allow unicode strings in grammars
  • allow non-string features in classifiers
  • modifications to HunposTagger
  • issues with DRS printing
  • fixed bigram collocation finder for window_size > 2
  • doctest paths no longer presume unix-style pathname separators
  • fixed issue with NLTK's tokenize module colliding with the Python tokenize module
  • fixed issue with stemming Unicode strings
  • changed ViterbiParser.nbest_parse to parse
  • ChaSen and KNBC Japanese corpus readers
  • preserve case in concordance display
  • fixed bug in simplification of Brown tags
  • a version of IBM Model 1 as described in Koehn 2010
  • new class AlignedSent for aligned sentence data and evaluation metrics
  • new nltk.util.set_proxy to allow easy configuration of HTTP proxy
  • improvements to downloader user interface to catch URL and HTTP errors
  • added CHILDES corpus reader
  • created special exception hierarchy for Prover9 errors
  • significant changes to the underlying code of the boxer interface
  • path-based wordnet similarity metrics use a fake root node for verbs, following the Perl version
  • added ability to handle multi-sentence discourses in Boxer
  • added the 'english' Snowball stemmer
  • simplifications and corrections of Earley Chart Parser rules
  • several changes to the feature chart parsers for correct unification
  • bugfixes: FreqDist.plot, FreqDist.max, NgramModel.entropy, CategorizedCorpusReader, DecisionTreeClassifier
  • removal of Python >2.4 language features for 2.4 compatibility
  • removal of deprecated functions and associated warnings
  • added semantic domains to wordnet corpus reader
  • changed wordnet similarity functions to include instance hyponyms
  • updated to use latest version of Boxer
  • Data:
  • JEITA Public Morphologically Tagged Corpus (in ChaSen format)
  • KNB Annotated corpus of Japanese blog posts
  • Fixed some minor bugs in alvey.fcfg, and added number of parse trees in alvey_sentences.txt
  • added more comtrans data

New in NLTK 2.0 Beta 9 (Jul 26, 2010)

  • Many code and documentation cleanups
  • Added port of Snowball stemmers
  • Fixed loading of pickled tokenizers (issue 556)
  • DecisionTreeClassifier now handles unknown features (issue 570)
  • Added error messages to LogicParser
  • Replaced max_models with end_size to prevent Mace from hanging
  • Added interface to Boxer
  • Added nltk.corpus.semcor to give access to SemCor 3.0 corpus (issue 530)
  • Added support for integer- and float-valued features in maxent Permit NgramModels to be pickled
  • Added Sourced Strings (see test/sourcedstring.doctest for details)
  • Fixed bugs in with Good-Turing and Simple Good-Turing Estimation (issue 26)
  • Add support for span tokenization, aka standoff annotation of segmentation (incl Punkt) allow unicode nodes in Tree.productions()
  • Fixed WordNet's morphy to be consistent with the original implementation, taking the shortest returned form instead of an arbitrary one (issues 427, 487)
  • Fixed bug in MaxentClassifier
  • Accepted bugfixes for YCOE corpus reader (issue 435)
  • Added test to _cumulative_frequencies() to correctly handle the case when no arguments are supplied
  • Added a TaggerI interface to the HunPos open-source tagger
  • Return 0, not None, when no count is present for a lemma in WordNet
  • fixed pretty-printing of unicode leaves
  • More efficient calculation of the leftcorner relation for left corner parsers
  • Added two functions for graph calculations: transitive closure and inversion.
  • FreqDist.pop() and FreqDist.popitems() now invalid

New in NLTK 2.0 Beta 8 (Mar 11, 2010)

  • NLTK:
  • fixed copyright and license statements
  • removed PyYAML, and added dependency to installers and download instructions
  • updated to LogicParser, DRT (Dan Garrette)
  • WordNet similarity metrics return None instead of -1 when
  • they fail to find a path (Steve Bethard)
  • shortest_path_distance uses instance hypernyms (Jordan Boyd-Graber)
  • clean_html improved (Bjorn Maeland)
  • batch_parse, batch_interpret and batch_evaluate functions allow
  • grammar or grammar filename as argument
  • more Portuguese examples (portuguese_en.doctest, examples/pt.py)
  • NLTK-Contrib:
  • Aligner implementations (Christopher Crowner, Torsten Marek)
  • ScriptTranscriber package (Richard Sproat and Kristy Hollingshead)
  • Book:
  • updates for second printing, correcting errata
  • http://nltk.googlecode.com/svn/trunk/nltk/doc/book/errata.txt
  • Data:
  • added Europarl sample, with 10 docs for each of 11 langs (Nitin Madnani)
  • added SMULTRON sample corpus (Torsten Marek, Martin Volk)

New in NLTK 2.0 Beta 6 (Sep 25, 2009)

  • NLTK:
  • minor fixes for Python 2.4 compatibility
  • added words() method to XML corpus reader
  • minor bugfixes and code clean-ups
  • fixed downloader to put data in %APPDATA% on Windows
  • Data:
  • Updated Punkt models
  • Fixed utf8 encoding issues with UDHR and Stopwords Corpora
  • Renamed CoNLL "cat" files to "esp" (different language)
  • Added Alvey NLT feature-based grammar
  • Added Polish PL196x corpus

New in NLTK 2.0 Beta 5 (Jul 20, 2009)

  • NLTK:
  • minor bugfixes (incl FreqDist, Python eggs)
  • added reader for Europarl Corpora (contributed by Nitin Madnani)
  • added reader for IPI PAN Polish Corpus (contributed by Konrad Goluchowski)
  • fixed data.py so that it doesn't generate a warning for Windows Python 2.6
  • NLTK-Contrib:
  • updated Praat reader (contributed by Margaret Mitchell)

New in NLTK 0.9.9 (May 30, 2009)

  • NLTK:
  • Finalized API for NLTK 2.0 and the book, incl dozens of small fixes
  • Names of the form nltk.foo.Bar now available as nltk.Bar for significant functionality; in some cases the name was modified (using old names will produce a deprecation warning)
  • Bugfixes in downloader, WordNet
  • Expanded functionality in DecisionTree
  • Bigram collocations extended for discontiguous bigrams
  • Translation toy nltk.misc.babelfish
  • New module nltk.help giving access to tagset documentation
  • Fix imports so that NLTK builds without Tkinter (Bjorn Maeland)
  • Data:
  • new maxent NE chunker model
  • updated grammar packages for the book
  • data for new tagsets collection, documenting several tagsets
  • added lolcat translation to the Genesis collection
  • Contrib (work in progress):
  • Updates to coreference package (Joseph Frazee)
  • New ISRI Arabic stemmer (Hosam Algasaier)
  • Updates to Toolbox package (Greg Aumann)
  • Book:
  • Substantial editorial corrections ahead of final submission

New in NLTK 0.9.9 Beta 1 (Mar 16, 2009)

  • NLTK:
  • Finalized API for NLTK 2.0 and the book
  • Names of the form nltk.foo.Bar now available as nltk.Bar for significant functionality; in some cases the name was modified (using old names will produce a deprecation warning)
  • Bugfixes in downloader, WordNet
  • Expanded functionality in DecisionTree
  • Translation toy nltk.misc.babelfish
  • New module nltk.help giving access to tagset documentation
  • Fix imports so that NLTK builds without Tkinter (Bjorn Maeland)
  • Data:
  • new maxent NE chunker model
  • updated grammar packages for the book
  • data for new tagsets collection, documenting several tagsets
  • added lolcat translation to the Genesis collection
  • Contrib (work in progress):
  • Updates to coreference package (Joseph Frazee)
  • New ISRI Arabic stemmer (Hosam Algasaier)
  • Updates to Toolbox package (Greg Aumann)
  • Book:
  • Substantial editorial corrections ahead of final submission

New in NLTK 0.9.8 (Feb 18, 2009)

  • NLTK:
  • New off-the-shelf tokenizer, POS tagger, and named-entity tagger
  • New metrics package with inter-annotator agreement scores, distance metrics, rank correlation
  • New collocations package (Joel Nothman)
  • Many clean-ups to WordNet package (Steven Bethard, Jordan Boyd-Graber)
  • Moved old pywordnet-based WordNet package to nltk_contrib
  • WordNet browser (Paul Bone)
  • New interface to dependency treebank corpora
  • Moved MinimalSet class into nltk.misc package
  • Put NLTK applications in new nltk.app package
  • Many other improvements incl semantics package, toolbox, MaltParser
  • Misc changes to many API names in preparation for 1.0, old names deprecated
  • Most classes now available in the top-level namespace
  • Work on Python egg distribution (Brandon Rhodes)
  • Removed deprecated code remaining from 0.8.versions
  • Fixes for Python 2.4 compatibility
  • Data:
  • Corrected identifiers in Dependency Treebank corpus
  • Basque and Catalan Dependency Treebanks (CoNLL 2007)
  • PE08 Parser Evalution data
  • New models for POS tagger and named-entity tagger
  • Book:
  • Substantial editorial corrections

New in NLTK 0.9.7 (Dec 19, 2008)

  • NLTK:
  • fixed problems with accessing zipped corpora
  • improved design and efficiency of grammars and chart parsers including new bottom-up combine strategy and a redesigned Earley strategy (Peter Ljunglof)
  • fixed bugs in smoothed probability distributions and added regression tests (Peter Ljunglof)
  • improvements to Punkt (Joel Nothman)
  • improvements to text classifiers
  • simple word-overlap RTE classifier
  • Data:
  • A new package of large grammars (Peter Ljunglof)
  • A small gazetteer corpus and corpus reader
  • Organized example grammars into separate packages
  • Childrens' stories added to gutenberg package
  • Contrib (work in progress):
  • fixes and demonstration for named-entity feature extractors in nltk_contrib.coref
  • Book:
  • extensive changes throughout, including new chapter 5 on classification and substantially revised chapter 11 on managing linguistic data

New in NLTK 0.9.6 (Dec 9, 2008)

  • NLTK:
  • new WordNet corpus reader (contributed by Steven Bethard)
  • incorporated dependency parsers into NLTK (was NLTK-Contrib) (contributed by Jason Narad)
  • moved nltk/cfg.py to nltk/grammar.py and incorporated dependency grammars
  • improved efficiency of unification algorithm
  • various enhancements to the semantics package
  • added plot() and tabulate() methods to FreqDist and ConditionalFreqDist
  • FreqDist.keys() and list(FreqDist) provide keys reverse-sorted by value, to avoid the confusion caused by FreqDist.sorted()
  • new downloader module to support interactive data download: nltk.download() run using "python -m nltk.downloader all"
  • fixed WordNet bug that caused min_depth() to sometimes give incorrect result
  • added nltk.util.Index as a wrapper around defaultdict(list) plus a functional-style initializer
  • fixed bug in Earley chart parser that caused it to break
  • added basic TnT tagger nltk.tag.tnt
  • new corpus reader for CoNLL dependency format (contributed by Kepa Sarasola and Iker Manterola)
  • misc other bugfixes
  • Contrib (work in progress):
  • TIGERSearch implementation by Torsten Marek
  • extensions to hole and glue semantics modules by Dan Garrette
  • new coreference package by Joseph Frazee
  • MapReduce interface by Xinfan Meng
  • Data:
  • Corpora are stored in compressed format if this will not compromise speed of access
  • Swadesh Corpus of comparative wordlists in 23 languages
  • Split grammar collection into separate packages
  • New Basque and Spanish grammar samples (contributed by Kepa Sarasola and Iker Manterola)
  • Brown Corpus sections now have meaningful names (e.g. 'a' is now 'news')
  • Fixed bug that forced users to manually unzip the WordNet corpus
  • New dependency-parsed version of Treebank corpus sample
  • Added movie script "Monty Python and the Holy Grail" to webtext corpus
  • Replaced words corpus data with a much larger list of English words
  • New URL for list of available NLTK corpora http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
  • Book:
  • complete rewrite of first three chapters to make the book accessible to a wider audience
  • new chapter on data-intensive language processing
  • extensive reworking of most chapters
  • Dropped subsection numbering; moved exercises to end of chapters
  • Distributions:
  • created Portfile to support Mac installation