What's new in NLTK 2.0.1 rc1
Apr 11, 2011
- added interface to the Stanford POS Tagger
- updates to sem.Boxer, sem.drt.DRS
- allow unicode strings in grammars
- allow non-string features in classifiers
- modifications to HunposTagger
- issues with DRS printing
- fixed bigram collocation finder for window_size > 2
- doctest paths no longer presume unix-style pathname separators
- fixed issue with NLTK's tokenize module colliding with the Python tokenize module
- fixed issue with stemming Unicode strings
- changed ViterbiParser.nbest_parse to parse
- ChaSen and KNBC Japanese corpus readers
- preserve case in concordance display
- fixed bug in simplification of Brown tags
- a version of IBM Model 1 as described in Koehn 2010
- new class AlignedSent for aligned sentence data and evaluation metrics
- new nltk.util.set_proxy to allow easy configuration of HTTP proxy
- improvements to downloader user interface to catch URL and HTTP errors
- added CHILDES corpus reader
- created special exception hierarchy for Prover9 errors
- significant changes to the underlying code of the boxer interface
- path-based wordnet similarity metrics use a fake root node for verbs, following the Perl version
- added ability to handle multi-sentence discourses in Boxer
- added the 'english' Snowball stemmer
- simplifications and corrections of Earley Chart Parser rules
- several changes to the feature chart parsers for correct unification
- bugfixes: FreqDist.plot, FreqDist.max, NgramModel.entropy, CategorizedCorpusReader, DecisionTreeClassifier
- removal of Python >2.4 language features for 2.4 compatibility
- removal of deprecated functions and associated warnings
- added semantic domains to wordnet corpus reader
- changed wordnet similarity functions to include instance hyponyms
- updated to use latest version of Boxer
- Data:
- JEITA Public Morphologically Tagged Corpus (in ChaSen format)
- KNB Annotated corpus of Japanese blog posts
- Fixed some minor bugs in alvey.fcfg, and added number of parse trees in alvey_sentences.txt
- added more comtrans data
New in NLTK 2.0 Beta 9 (Jul 26, 2010)
- Many code and documentation cleanups
- Added port of Snowball stemmers
- Fixed loading of pickled tokenizers (issue 556)
- DecisionTreeClassifier now handles unknown features (issue 570)
- Added error messages to LogicParser
- Replaced max_models with end_size to prevent Mace from hanging
- Added interface to Boxer
- Added nltk.corpus.semcor to give access to SemCor 3.0 corpus (issue 530)
- Added support for integer- and float-valued features in maxent Permit NgramModels to be pickled
- Added Sourced Strings (see test/sourcedstring.doctest for details)
- Fixed bugs in with Good-Turing and Simple Good-Turing Estimation (issue 26)
- Add support for span tokenization, aka standoff annotation of segmentation (incl Punkt) allow unicode nodes in Tree.productions()
- Fixed WordNet's morphy to be consistent with the original implementation, taking the shortest returned form instead of an arbitrary one (issues 427, 487)
- Fixed bug in MaxentClassifier
- Accepted bugfixes for YCOE corpus reader (issue 435)
- Added test to _cumulative_frequencies() to correctly handle the case when no arguments are supplied
- Added a TaggerI interface to the HunPos open-source tagger
- Return 0, not None, when no count is present for a lemma in WordNet
- fixed pretty-printing of unicode leaves
- More efficient calculation of the leftcorner relation for left corner parsers
- Added two functions for graph calculations: transitive closure and inversion.
- FreqDist.pop() and FreqDist.popitems() now invalid
New in NLTK 2.0 Beta 8 (Mar 11, 2010)
- NLTK:
- fixed copyright and license statements
- removed PyYAML, and added dependency to installers and download instructions
- updated to LogicParser, DRT (Dan Garrette)
- WordNet similarity metrics return None instead of -1 when
- they fail to find a path (Steve Bethard)
- shortest_path_distance uses instance hypernyms (Jordan Boyd-Graber)
- clean_html improved (Bjorn Maeland)
- batch_parse, batch_interpret and batch_evaluate functions allow
- grammar or grammar filename as argument
- more Portuguese examples (portuguese_en.doctest, examples/pt.py)
- NLTK-Contrib:
- Aligner implementations (Christopher Crowner, Torsten Marek)
- ScriptTranscriber package (Richard Sproat and Kristy Hollingshead)
- Book:
- updates for second printing, correcting errata
- http://nltk.googlecode.com/svn/trunk/nltk/doc/book/errata.txt
- Data:
- added Europarl sample, with 10 docs for each of 11 langs (Nitin Madnani)
- added SMULTRON sample corpus (Torsten Marek, Martin Volk)
New in NLTK 2.0 Beta 6 (Sep 25, 2009)
- NLTK:
- minor fixes for Python 2.4 compatibility
- added words() method to XML corpus reader
- minor bugfixes and code clean-ups
- fixed downloader to put data in %APPDATA% on Windows
- Data:
- Updated Punkt models
- Fixed utf8 encoding issues with UDHR and Stopwords Corpora
- Renamed CoNLL "cat" files to "esp" (different language)
- Added Alvey NLT feature-based grammar
- Added Polish PL196x corpus
New in NLTK 2.0 Beta 5 (Jul 20, 2009)
- NLTK:
- minor bugfixes (incl FreqDist, Python eggs)
- added reader for Europarl Corpora (contributed by Nitin Madnani)
- added reader for IPI PAN Polish Corpus (contributed by Konrad Goluchowski)
- fixed data.py so that it doesn't generate a warning for Windows Python 2.6
- NLTK-Contrib:
- updated Praat reader (contributed by Margaret Mitchell)
New in NLTK 0.9.9 (May 30, 2009)
- NLTK:
- Finalized API for NLTK 2.0 and the book, incl dozens of small fixes
- Names of the form nltk.foo.Bar now available as nltk.Bar for significant functionality; in some cases the name was modified (using old names will produce a deprecation warning)
- Bugfixes in downloader, WordNet
- Expanded functionality in DecisionTree
- Bigram collocations extended for discontiguous bigrams
- Translation toy nltk.misc.babelfish
- New module nltk.help giving access to tagset documentation
- Fix imports so that NLTK builds without Tkinter (Bjorn Maeland)
- Data:
- new maxent NE chunker model
- updated grammar packages for the book
- data for new tagsets collection, documenting several tagsets
- added lolcat translation to the Genesis collection
- Contrib (work in progress):
- Updates to coreference package (Joseph Frazee)
- New ISRI Arabic stemmer (Hosam Algasaier)
- Updates to Toolbox package (Greg Aumann)
- Book:
- Substantial editorial corrections ahead of final submission
New in NLTK 0.9.9 Beta 1 (Mar 16, 2009)
- NLTK:
- Finalized API for NLTK 2.0 and the book
- Names of the form nltk.foo.Bar now available as nltk.Bar for significant functionality; in some cases the name was modified (using old names will produce a deprecation warning)
- Bugfixes in downloader, WordNet
- Expanded functionality in DecisionTree
- Translation toy nltk.misc.babelfish
- New module nltk.help giving access to tagset documentation
- Fix imports so that NLTK builds without Tkinter (Bjorn Maeland)
- Data:
- new maxent NE chunker model
- updated grammar packages for the book
- data for new tagsets collection, documenting several tagsets
- added lolcat translation to the Genesis collection
- Contrib (work in progress):
- Updates to coreference package (Joseph Frazee)
- New ISRI Arabic stemmer (Hosam Algasaier)
- Updates to Toolbox package (Greg Aumann)
- Book:
- Substantial editorial corrections ahead of final submission
New in NLTK 0.9.8 (Feb 18, 2009)
- NLTK:
- New off-the-shelf tokenizer, POS tagger, and named-entity tagger
- New metrics package with inter-annotator agreement scores, distance metrics, rank correlation
- New collocations package (Joel Nothman)
- Many clean-ups to WordNet package (Steven Bethard, Jordan Boyd-Graber)
- Moved old pywordnet-based WordNet package to nltk_contrib
- WordNet browser (Paul Bone)
- New interface to dependency treebank corpora
- Moved MinimalSet class into nltk.misc package
- Put NLTK applications in new nltk.app package
- Many other improvements incl semantics package, toolbox, MaltParser
- Misc changes to many API names in preparation for 1.0, old names deprecated
- Most classes now available in the top-level namespace
- Work on Python egg distribution (Brandon Rhodes)
- Removed deprecated code remaining from 0.8.versions
- Fixes for Python 2.4 compatibility
- Data:
- Corrected identifiers in Dependency Treebank corpus
- Basque and Catalan Dependency Treebanks (CoNLL 2007)
- PE08 Parser Evalution data
- New models for POS tagger and named-entity tagger
- Book:
- Substantial editorial corrections
New in NLTK 0.9.7 (Dec 19, 2008)
- NLTK:
- fixed problems with accessing zipped corpora
- improved design and efficiency of grammars and chart parsers including new bottom-up combine strategy and a redesigned Earley strategy (Peter Ljunglof)
- fixed bugs in smoothed probability distributions and added regression tests (Peter Ljunglof)
- improvements to Punkt (Joel Nothman)
- improvements to text classifiers
- simple word-overlap RTE classifier
- Data:
- A new package of large grammars (Peter Ljunglof)
- A small gazetteer corpus and corpus reader
- Organized example grammars into separate packages
- Childrens' stories added to gutenberg package
- Contrib (work in progress):
- fixes and demonstration for named-entity feature extractors in nltk_contrib.coref
- Book:
- extensive changes throughout, including new chapter 5 on classification and substantially revised chapter 11 on managing linguistic data
New in NLTK 0.9.6 (Dec 9, 2008)
- NLTK:
- new WordNet corpus reader (contributed by Steven Bethard)
- incorporated dependency parsers into NLTK (was NLTK-Contrib) (contributed by Jason Narad)
- moved nltk/cfg.py to nltk/grammar.py and incorporated dependency grammars
- improved efficiency of unification algorithm
- various enhancements to the semantics package
- added plot() and tabulate() methods to FreqDist and ConditionalFreqDist
- FreqDist.keys() and list(FreqDist) provide keys reverse-sorted by value, to avoid the confusion caused by FreqDist.sorted()
- new downloader module to support interactive data download: nltk.download() run using "python -m nltk.downloader all"
- fixed WordNet bug that caused min_depth() to sometimes give incorrect result
- added nltk.util.Index as a wrapper around defaultdict(list) plus a functional-style initializer
- fixed bug in Earley chart parser that caused it to break
- added basic TnT tagger nltk.tag.tnt
- new corpus reader for CoNLL dependency format (contributed by Kepa Sarasola and Iker Manterola)
- misc other bugfixes
- Contrib (work in progress):
- TIGERSearch implementation by Torsten Marek
- extensions to hole and glue semantics modules by Dan Garrette
- new coreference package by Joseph Frazee
- MapReduce interface by Xinfan Meng
- Data:
- Corpora are stored in compressed format if this will not compromise speed of access
- Swadesh Corpus of comparative wordlists in 23 languages
- Split grammar collection into separate packages
- New Basque and Spanish grammar samples (contributed by Kepa Sarasola and Iker Manterola)
- Brown Corpus sections now have meaningful names (e.g. 'a' is now 'news')
- Fixed bug that forced users to manually unzip the WordNet corpus
- New dependency-parsed version of Treebank corpus sample
- Added movie script "Monty Python and the Holy Grail" to webtext corpus
- Replaced words corpus data with a much larger list of English words
- New URL for list of available NLTK corpora http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
- Book:
- complete rewrite of first three chapters to make the book accessible to a wider audience
- new chapter on data-intensive language processing
- extensive reworking of most chapters
- Dropped subsection numbering; moved exercises to end of chapters
- Distributions:
- created Portfile to support Mac installation