I would like Python to store words, not characters as a basic unit in the sentence.
import nltk from nltk.tokenize import RegexpTokenizer sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') word_tokenizer = RegexpTokenizer(r'\w+') my_text = 'WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about ''the pain of a broken trust'' that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Frankly, the onus is on law enforcement because we are the ones who have taken the oath to protect and to serve the people of this city,'' Ms. Lynch said in 2000.' len(my_text) Out: 498 my_sents = sent_tokenizer.tokenize(my_text) len(my_sents) Out: 2
However if I ask to output the length of the first sentence – it gives me its length in characters:
len(my_sents) Out: 337
I can get individual words (not structured into sentences) by tokenising the sentences:
my_words = word_tokenizer.tokenize(str(sents)) len(my_words) Out: 86
But is it possible to store the words in a sentence structure? E.g –
print 'The sentence has ', len(my_sents), ' words' The sentence has 64 words