Change nltk word tagged sentence elements and write it back. Can NLTK create a sentence?

Imagine this:

text = word_tokenize("And now, for something completely different, I will read this Python Book!")
tagged = nltk.pos_tag(text)

I would like to be able to find any JJ elements and edit with my own so instead of:

[('And', 'CC'), ('now', 'RB'), (',', ','), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ'), (',', ','), ('I', 'PRP'), ('will', 'MD'), ('read', 'VB'), ('this', 'DT'), ('Python', 'NNP'), ('Book', 'NNP'), ('!', '.')]

I will have:

[('And', 'CC'), ('now', 'RB'), (',', ','), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('distinctive', 'JJ'), (',', ','), ('I', 'PRP'), ('will', 'MD'), ('read', 'VB'), ('this', 'DT'), ('Python', 'NNP'), ('Book', 'NNP'), ('!', '.')]

Because you cannot change/update a tuple I managed to replace the JJ different with distinctive by creating another tuple with those values already changed.

Not I would like to re-create my sentence back from this second list of tuples(this should be a correct representation of a word tagged sentence)

On short, from this:

[('And', 'CC'), ('now', 'RB'), (',', ','), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('distinctive', 'JJ'), (',', ','), ('I', 'PRP'), ('will', 'MD'), ('read', 'VB'), ('this', 'DT'), ('Python', 'NNP'), ('Book', 'NNP'), ('!', '.')]

I would like to get this: And now, for something completely distinctive, I will read this Python Book!

If I try and join the first tuple elements from that list my sentence will have extra spaces giving me this:

And now , for something completely different , I will read this Python Book !

  • Does NLTK have something for constructing a sentence from it's tagged form? (my searches returned nothing)
  • What other approach do you suggest to accomplish the desired result?

Please note this is not a simple synonym replacement but a custom replacement of different word tags.


ANSWERS:


If your question is about attaching the punctuation properly to the preceding word, then no, the NLTK doesn't have any facilities for this. As for extracting the words out of the tagged sentence, it is trivial to do with a comprehension (as you have presumably done already):

words = [ w for w, t in tagged ]

To combine punctuation properly, form a string and post-process it using a regexp. There's enough indeterminacy that you can't guarantee that you'll reproduce the original arrangement, but that's ok if you just want something that looks realistic.

The general rule is that punctuation attaches to the previous token, except for quote marks and opening parens (and square brackets, if they appear in your text). You also need to look after some special nltk tokenization rules, as shown here:

>>> sent = "And now: \"I'll read the Python book...\" (oh no, I won't!)"
>>> nltk.word_tokenize(sent)
['And', 'now', ':', '``', 'I', "'ll", 'read', 'the', 'Python', 'book', '...', "''", 
 '(', 'oh', 'no', ',', 'I', 'wo', "n't", '!', ')']

Note how the nltk splits contractions ("I", "'ll"), n't becomes a separate token, and double quote characters are replaced with directed quotes. So here's a simple clean-up to get you started:

>>> newsent = " ".join(tokens)
>>> newsent = re.sub(r" (n't|'\w+)\b", r"\1", newsent)  # contractions
>>> newsent = re.sub(r"(\(|``) ",      r"\1", newsent)  # right-attaching punctuation
>>> newsent = re.sub(r" ([^\s\w(`])",  r"\1", newsent)  # other punctuation
>>> newsent = re.sub(r"``|''", '"', newsent)            # restore double quotes
>>> newsent
'And now: "I\'ll read the Python book..." (oh no, I won\'t!)'


 MORE:


 ? Python NLTK :: Intersecting words and sentences
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]
 ? ThreadPoolExecutor is not defined [python3]
 ? How to inspect generators in the repl/ipython in Python3
 ? Python3: ReferenceError: weakly-referenced object no longer exists
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"
 ? body = 'cmd=' + urllib_parse.quote_plus(unicode(verb).encode('utf-8')) returns "name 'unicode' is not defined"
 ? How to convert Selenese (html) to Python programmatically?