Im using the tweepy library to get a list of tweet texts, and I need to compare the words in 200 tweets with a list of stop words and remove the stop words in the tweet text list so I can say what are the words that appear most in the tweets searched.
The thing is when I retrieve the tweet.texts I must encode it in order to get it so I get a list of b'word', which cannot be compared with my list of stop words that are regular strings.
def get_tweetwords():
print("Ingrese el hashtag a buscar, no olvide escribir el numeral (#)(Ctrl+3)")
hashtag=str(input())
while (hashtag[0])!="#":
print("No olvide escribir el numeral. Vuelva a escribir el hashtag:")
hashtag=str(input())
busqueda=tweepy.Cursor(api.search,q=hashtag).items(10)
twlist=[]
listadepalabras=[]
for tweets in busqueda:
twlist.append(tweets.text.encode('utf-8'))
for i in twlist:
x=i.split()
for j in x:
listadepalabras.append(j)
return(listadepalabras)
I need to decode the listadepalabras
list into a string list in order to compare it with the stopwords
list and remove its stopwords.
def get_stopwords():
listastopwords=[item for item in open("stopwords.txt").readlines()]
for item in listastopwords:
if "\n" in listastopwords:
listastopwords[listastopwords.index(item)]=item.replace("\n","")
return(listastopwords)
def sacar_stopwords():
listadepalabras=get_tweetwords()
listastopwords=get_stopwords()
for i in listadepalabras:
for j in listastopwords:
if i==j:
listadepalabras.remove(j)
return(listadepalabras)
This is not working since my text list contains words with a b'word'
format and my stopwords list is just 'word'
def repeticiones_palabra():
listadepalabras=sacar_stopwords()
diccionario=collections.Counter(listadepalabras)
diccionario=dict(diccionario.most_common(10))
print ("-LAS 10 PALABRAS MAS UTILIZADAS SON-")
print(diccionario)
This should get me the 10 most used words in my list and it's working fine but I'm getting mostly stopwords and the hashtag I looked for so I can tell my stopwords are not getting removed from the list.
repeticiones_palabra()
I hope I made myself clear, im a beginner to python and to stack overflow as well. Thanks in advance.
b'some string' is BYTESTRING
To convert it to regular string, you must use .decode() in-built function of string object:
lst = [b'abc', b'def']
# >>> [b'abc', b'def']
lst2 = [s.decode("utf-8") for s in lst]
# >>> ['abc', 'def']
EDIT: I see that you are previously encoding the strings on line
for tweets in busqueda:
twlist.append(tweets.text.encode('utf-8'))
Why do you encode and then want to decode?
Just change it to:
for tweets in busqueda:
twlist.append(tweets.text)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.