简体   繁体   中英

How do I discover list of words from corpus which distinguish from another corpus? Python

I have two lists of unstructured text input, and I want to find the words that distinguish listA from listB . For example, if listA were the text of "Harry Potter" and listB were the text of "Ender's Game", the distinguishes elements for listA would be [wand, magic, wizard, . . .] [wand, magic, wizard, . . .] [wand, magic, wizard, . . .] and the distinguishing elements for listB would be [ender, buggers, battle, . . .] [ender, buggers, battle, . . .]

I've tried a bit with the python-nltk module, and am able to easily find the most common words in each list, but that is not exactly what I'm after.

You can use synsets to get it done. To get synsets NLTK include a very powerful library called wordnet.

Wordnet is a big 'database' (in lack of a better word) of human language, not only english, it supports many other languages.

Synset is a like a similar idea you get when you hear a term. Almost like a synonym, but not that strict. Please, go to the link, its a better definition.

Synset Closures is what can help you the most. For example, 'bee' is a animial, a insect, a living thing; Harry Potter is fictional, human, wizard.

from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
hyper = lambda s: s.hypernyms()
list(dog.closure(hyper))

Heres a book that teach you the surface of nltk, is not very good but is a good place to start along with NTLK HOWTOs

If you want something deeper I cant help you, I dont know most of the definitions and functions NTLK provide us, but synsets are a great place to start.

I've tried a bit with the python-nltk, and am able to easily find the most common words in each list, but not exactly what I'm after

I'm guessing what you mean by this is it's coming up with words like "and", "the", "of", etc. as the words with the highest frequency. These words aren't very helpful, they are basically just the glue that holds words together to form a sentence, you could remove them but you would need a list of "useless" words called a stoplist, nltk has such a list from nltk.corpus import stop words .

You might want to take a look at TF.IDF scoring. This will give a higher weight to the words that are common in one document but uncommon in general. Usually you would use a large corpora to calculate which words are common in general.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM