简体   繁体   中英

Remove Stopwords in python

I'm developing an algorithm to remove stopword. I am transforming a txt file into a list and thus passing in the algorithm for removal.

Example of file lines:

'mora vai nascer viver cair falar','positivo'
'deixa ver entendi vai crescer vai passar ve','positivo'
'so deveria ter foi agradeco de passei passei fez','positivo'
'nunca nao nao muito nao mais','negativo'
'a nao ate infelizmente ai ate quando','negativo'
'nao perto nao quanto menos nao sim nao nem simplesmente','negativo'

Code

with open('BasePalavras.txt') as arquivo:
     baseTeste = [linha.strip() for linha in arquivo]


stopwords = ['a', 'agora', 'algum', 'alguma', 'aquele', 'aqueles', 'de', 'deu', 'do', 'e', 'estou', 'esta', 'esta',
         'ir', 'meu', 'muito', 'mesmo', 'no', 'nossa', 'o', 'outro', 'para', 'que', 'sem', 'talvez', 'tem', 'tendo',
         'tenha', 'teve', 'tive', 'todo', 'um', 'uma', 'umas', 'uns', 'vou']

def removestopword(texto):
     frases=[]
     for(palavras, emocao) in texto:
         semstopwords = [p for p in palavras.splits() if p not in stopwords]
         frases.append((semstopwords, emocao))
return frases

print (removestopword(baseTeste))

ERROR

Traceback (most recent call last):
     File "C:/Users/Rivaldo/PycharmProjects/Mineracao/Principal.py", line 22, in <module>
          print (removestopword(baseTeste))
     File "C:/Users/Rivaldo/PycharmProjects/Mineracao/Principal.py", line 17, in removestopword
          for(palavras, emocao) in texto:
   ValueError: too many values to unpack

Try this:

with open('BasePalavras.txt') as arquivo:
    baseTeste = [linha.strip().split(',') for linha in arquivo]


stopwords = ['a', 'agora', 'algum', 'alguma', 'aquele', 'aqueles', 'de', 'deu', 'do', 'e', 'estou', 'esta', 'esta',
         'ir', 'meu', 'muito', 'mesmo', 'no', 'nossa', 'o', 'outro', 'para', 'que', 'sem', 'talvez', 'tem', 'tendo',
         'tenha', 'teve', 'tive', 'todo', 'um', 'uma', 'umas', 'uns', 'vou']

def removestopword(texto):
    frases=[]
    for (palavras, emocao) in texto:
        semstopwords = [p for p in palavras.split() if p not in stopwords]
        frases.append((semstopwords, emocao))
    return frases

print (removestopword(baseTeste))

Changed baseTeste = [linha.strip() for linha in arquivo] to baseTeste = [linha.strip().split(',') for linha in arquivo]

and

semstopwords = [p for p in palavras.splits() if p not in stopwords] to semstopwords = [p for p in palavras.split() if p not in stopwords] .

Here's how I would do it.

stopwords = ['a', 'agora', 'algum', 'alguma', 'aquele', 'aqueles', 'de', 'deu', 'do', 'e', 'estou', 'esta', 'esta',
     'ir', 'meu', 'muito', 'mesmo', 'no', 'nossa', 'o', 'outro', 'para', 'que', 'sem', 'talvez', 'tem', 'tendo',
     'tenha', 'teve', 'tive', 'todo', 'um', 'uma', 'umas', 'uns', 'vou']

def remove_stopwords(text):
    phrases = []
    for (sentence, _) in text:
        sentence_without_stopwords = [word for word in sentence.split() if word not in stopwords]
        phrases.append(sentence_without_stopwords)
    return phrases

with open('input.txt') as raw_text:
    sentence_sentiments = []
    lines = [line for line in raw_text]
    for line in lines:
        sentence, sentiment = line.split(',')
        sentence_sentiments.append((sentence[1:-1], sentiment[1:-1]))
    print(remove_stopwords(sentence_sentiments))

Notice how, in your provided code, baseTeste is an array that contains a list of strings, representing the lines of your input file. This is not what you want, as you're attempting to loop ( for(palavras, emocao) in texto: ) over the (sentence, sentiment) pairs inside these lines. You are thus missing the middle step of splitting each line into (sentence, sentiment) pairs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM