NLTK sentence tokenisation with words as sentence units

I would like Python to store words, not characters as a basic unit in the sentence.

import nltk
from nltk.tokenize import RegexpTokenizer

sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
word_tokenizer = RegexpTokenizer(r'\w+')

my_text = 'WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about ''the pain of a broken trust'' that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Frankly, the onus is on law enforcement because we are the ones who have taken the oath to protect and to serve the people of this city,'' Ms. Lynch said in 2000.'

len(my_text)
Out[129]: 498

my_sents = sent_tokenizer.tokenize(my_text)

len(my_sents)
Out[132]: 2

However if I ask to output the length of the first sentence – it gives me its length in characters:

len(my_sents[0])
Out[133]: 337

I can get individual words (not structured into sentences) by tokenising the sentences:

my_words = word_tokenizer.tokenize(str(sents))
len(my_words)
Out[140]: 86

But is it possible to store the words in a sentence structure? E.g –

print 'The sentence has ', len(my_sents[0]), ' words'
The sentence has 64 words


ANSWERS:


import nltk
nltk.word_tokenize("Tokenize this!")

results in

['Tokenize', 'this', '!']

Is that what you're after?



 MORE:


 ? Change nltk word tagged sentence elements and write it back. Can NLTK create a sentence?
 ? Python NLTK :: Intersecting words and sentences
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]
 ? How to inspect generators in the repl/ipython in Python3
 ? Python3: ReferenceError: weakly-referenced object no longer exists
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"