简体   繁体   中英

How to count the top 150 words and remove common words from 2 lists?

This code below is to find out the top 150 words which appeared the most in 2 strings.

pwords = re.findall(r'\w+',p)
ptop150words=Counter(pwords).most_common(150)
sorted(ptop150words)

nwords = re.findall(r'\w+',n)
ntop150words=Counter(nwords).most_common(150)
sorted(ntop150words)

This code below is to remove the common words which appeared in the 2 strings.

def new(ntopwords,ptopwords):
    for i in ntopwords[:]:
        if i in potopwords:
            ntopwords.remove(i)
            ptopwords.remove(i)
print(i)

However, there is no output for print(i). what is wrong?

Most likely your indentation.

new(negativetop150words,positivetop150words):
    for i in negativetop150words[:]:
        if i in positivetop150words:
            negativetop150words.remove(i)
            positivetop150words.remove(i)
            print(i)

The code you posted does not call the function new(negativetop150words, positivetop150words) Also per Jesse's comment, the print(i) command is outside the function. Here's the code that worked for me:

import re
from collections import Counter

def new(negativetop150words, positivetop150words):
    for i in negativetop150words[:]:
        if i in positivetop150words:
            negativetop150words.remove(i)
            positivetop150words.remove(i)
            print(i)

    return negativetop150words, positivetop150words

positive = 'The FDA is already fairly gung-ho about providing this. It receives about 1,000 applications a year and approves all but 1%. The agency makes sure there is sound science behind the request, and no obvious indication that the medicine would harm the patient.'
negative = 'Thankfully these irritating bits of bureaucracy have been duly dispatched. This victory comes courtesy of campaigning work by a libertarian think-tank, the Goldwater Institute, based in Arizona. It has been pushing right-to-try legislation for around four years, and it can now be found in 40 states. Speaking about the impact of these laws on patients, Arthur Caplan, a professor of bioethics at NYU School of Medicine in New York, says he can think of one person who may have been helped.'

positivewords = re.findall(r'\w+', positive)
positivetop150words = Counter(positivewords).most_common(150)
sorted(positivetop150words)

negativewords = re.findall(r'\w+', negative)
negativetop150words = Counter(negativewords).most_common(150)

words = new(negativewords, positivewords)

This prints:

a
the
It
and
about
the

You could rely on set methods. Once you have both lists, you convert them to sets. The common subset is the intersection of the 2 sets, and you can simply take the difference from both original sets:

positiveset = set(positivewords)
negativeset = set(negativewords)
commons = positiveset & negativeset
positivewords = sorted(positiveset - commons)
negativewords = sorted(negativeset - commons)
commonwords = sorted(commons)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM