简体   繁体   中英

How to convert pandas data frame in list of words for nltk-collocation-finder?

As a linguist and a python-beginner I want to find word-collocations in my own (german) tweet-corpus. How can I convert the tweets from a pandas dataframe (just one column = tweet) into a list of words to then be able to use the nltk-collocation-finder? My version (below) creates a list of letters and not a list of words and just gives me letter-collocations. Any advice would be great!

This is what I have so far:

import pandas as pd
data = pd.read_csv("tweets.csv")

import regex as re
def cleaningTweets(twt):
    twt = re.sub('@[A-ZÜÄÖa-züäöß0-9]+', '', twt)
    twt = re.sub('#', '', twt)
    twt = re.sub('https?:\/\/\S+', '', twt)
    return twt

df = pd.DataFrame(data)

df.tweet = df.tweet.apply(cleaningTweets)
df.tweet = df.tweet.str.lower()

from textblob_de import TextBlobDE as TextBlob
df["tweet_tok"] = df["tweet"].apply(lambda x: " ".join(TextBlob(x).words))

all_words = ' '.join([text for text in df.tweet_tok])
tweettext = nltk.Text(all_words)

If all you are after is a list of words from a sentence, I think you are looking for the .split method on a Python string object. Pandas has a built-in method to apply string splitting to each row in a DataFrame (or Series), and expand out to individual columns if you need it.

For example, try this little piece of code and see if it does what you want:

import pandas as pd
strings_to_split = [
    "i like to be beside the sea",
    "me too"
]
pd.Series(strings_to_split).str.split(expand=True)

A couple of notes:

  • Simply calling .split() splits on whitespace, but you can pass any character to perform the split, eg .split('a')
  • Per the question in the comments below, pass expand=False to keep the list in each row instead of expanding out to columns

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM