NLTK Sentence Tokenizer Incorrect

I've noticed that the NLTK sent_tokenizer makes mistakes with some dates. Is there any way to adjust it so that it can correctly tokenize the following:

valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.

Currently running sent_tokenize results in:

['valid any day after january 1. not valid on federal holidays, including february 14, 
 or with other in-house events, specials, or happy hour.']

But it should result in:

['valid any day after january 1.', 'not valid on federal holidays, including february 14, 
  or with other in-house events, specials, or happy hour.']

as the period after 'january 1' is a legitimate sentence termination character.


ANSWERS:


Firstly, the sent_tokenize function uses the punkt tokenizer that was used to tokenize well-formed English sentence. So by including the correct capitalization would have resolve your problem:

>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

Now, let's dig deeper, The Punkt tokenizer is an algorithm by Kiss and Strunk (2005), see for the implementation.

This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

So in the case of sent_tokenize, I'm quite sure it's train on a well-formed English corpus hence the fact that capitalization after a fullstop is a strong indication of sentence boundary. And fullstop itself might not be since we have things like i.e. , e.g.

And in some cases the corpus might have things like 01. put pasta in pot \n02. fill the pot with water. With such sentence/documents in the training data, it is very likely that the algorithm thinks that fullstop following a non-captalized word is not a sentence boundary.

So to resolve the problem, I suggest the following:

  1. Manually segment 10-20% of your sentences and the retrain a corpus specific tokenizer
  2. Convert your corpus into well-formed orthography before using sent_tokenize

See also: training data format for nltk punkt



 MORE:


 ? NLTK sentence tokenisation with words as sentence units
 ? Change nltk word tagged sentence elements and write it back. Can NLTK create a sentence?
 ? Python NLTK :: Intersecting words and sentences
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]
 ? How to inspect generators in the repl/ipython in Python3
 ? Python3: ReferenceError: weakly-referenced object no longer exists
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"