简体   繁体   中英

Convert a list of 'utf-8' encoded strings into regular strings

Im using the tweepy library to get a list of tweet texts, and I need to compare the words in 200 tweets with a list of stop words and remove the stop words in the tweet text list so I can say what are the words that appear most in the tweets searched.

The thing is when I retrieve the tweet.texts I must encode it in order to get it so I get a list of b'word', which cannot be compared with my list of stop words that are regular strings.

def get_tweetwords():
    print("Ingrese el hashtag a buscar, no olvide escribir el numeral (#)(Ctrl+3)")
    hashtag=str(input())
    while (hashtag[0])!="#":
        print("No olvide escribir el numeral. Vuelva a escribir el hashtag:")
        hashtag=str(input())
    busqueda=tweepy.Cursor(api.search,q=hashtag).items(10)
    twlist=[]
    listadepalabras=[]
    for tweets in busqueda:
        twlist.append(tweets.text.encode('utf-8'))
    for i in twlist:
        x=i.split()
    for j in x:
        listadepalabras.append(j)
    return(listadepalabras)

I need to decode the listadepalabras list into a string list in order to compare it with the stopwords list and remove its stopwords.

def get_stopwords():   
    listastopwords=[item for item in open("stopwords.txt").readlines()]
    for item in listastopwords:
        if "\n" in listastopwords:
           listastopwords[listastopwords.index(item)]=item.replace("\n","")
    return(listastopwords)

def sacar_stopwords():
    listadepalabras=get_tweetwords()
    listastopwords=get_stopwords()
    for i in listadepalabras:
        for j in listastopwords:
            if i==j:
                listadepalabras.remove(j)
    return(listadepalabras)

This is not working since my text list contains words with a b'word' format and my stopwords list is just 'word'

def repeticiones_palabra():
    listadepalabras=sacar_stopwords()
    diccionario=collections.Counter(listadepalabras)
    diccionario=dict(diccionario.most_common(10))
    print ("-LAS 10 PALABRAS MAS UTILIZADAS SON-")
    print(diccionario)

This should get me the 10 most used words in my list and it's working fine but I'm getting mostly stopwords and the hashtag I looked for so I can tell my stopwords are not getting removed from the list.

repeticiones_palabra()

I hope I made myself clear, im a beginner to python and to stack overflow as well. Thanks in advance.

b'some string' is BYTESTRING

To convert it to regular string, you must use .decode() in-built function of string object:

lst = [b'abc', b'def']
# >>> [b'abc', b'def']
lst2 = [s.decode("utf-8") for s in lst]
# >>> ['abc', 'def']

EDIT: I see that you are previously encoding the strings on line

for tweets in busqueda:
    twlist.append(tweets.text.encode('utf-8'))

Why do you encode and then want to decode?

Just change it to:

for tweets in busqueda:
    twlist.append(tweets.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM