How to keep certain entities as one word using nltk tokenize in python?

How does one keep certain strings together in the following? For example,

sentence = "?!a# .see"
tokens = nltk.word_tokenize(sentence)  
tokens 

gives

['!', '?', 'a', '#', '.see'] rather than keeping '?!a#' as one entity.


ANSWERS:


Seems like what you want to do is to split the string with whitespace. So just calling split would suffice:

>>> sentence.split()
['?!a#', '.see']

However if you really want to use a tokenizer, you can use a Regexp tokenizer:

>>> word_tokenizer = RegexpTokenizer('[\S]+') 
>>> word_tokenizer.tokenize(sentence)
['?!a#', '.see']

'\S' matches any non-whitespace character.



 MORE:


 ? How to keep certain entities as one word using nltk tokenize in python?
 ? Python nltk: Find collocations without dot-separated words
 ? Preventing splitting at apostrophies when tokenizing words using nltk
 ? Don't want NLTK word tokenize to tokenize a single word 'gotta' into 'got' and 'ta'
 ? Python NLTK tokenize sentence with wrong syntax from human errors
 ? Python NLTK tokenize sentence with wrong syntax from human errors
 ? Python NLTK tokenize sentence with wrong syntax from human errors
 ? python nltk keyword extraction from sentence
 ? Improving the extraction of human names with nltk
 ? what is wrong with this code of nltk python