Removing non-english words from a sentence in python

Question

I have written a code which sends queries to Google and returns the results. I extract the snippets(summaries) from these results for further processing. However, sometime non-english words are in these snippets which I don't want them. for example:

/\u02b0w\u025bn w\u025bn unstressed \u02b0w\u0259n w\u0259n/

I only want the "unstressed" word in this sentence. How can I do that? thanks

Answer 1

PyEnchant might be a simple option for you. I do not know about its speed, but you can do things like:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

A tutorial is found here , it also has options to return suggestions which you can you again for another query or something. In addition you can check if your result is in latin-1 (is_utf8() excists, do not know if is_latin-1() does also, maybe use something like Enca which detects the encoding of text files, on the basis of knowledge of their language.)

Answer 2

You can compare the words you receive with a dictionary of english words, for example /usr/share/dict/words on a BSD system.

I would guess that googles results for the most part is grammatically correct, but if not, you might have to look into stemming in order to match against your dictionary.

Answer 3

You can use PyWordNet. That is a python interface for the WordNet. Just split your sentence on white spaces and check for each word is it in the dictionary.

Removing non-english words from a sentence in python

Question

3 answers

solution1
3 2010-10-27 09:23:44

solution2
1 2010-10-27 09:15:52

solution3
1 2010-10-27 09:20:55

Removing non-english words from a sentence in python

Question

3 answers

solution1 3 2010-10-27 09:23:44

solution2 1 2010-10-27 09:15:52

solution3 1 2010-10-27 09:20:55

solution1
3 2010-10-27 09:23:44

solution2
1 2010-10-27 09:15:52

solution3
1 2010-10-27 09:20:55