As a linguist and a python-beginner I want to find word-collocations in my own (german) tweet-corpus. How can I convert the tweets from a pandas dataframe (just one column = tweet) into a list of words to then be able to use the nltk-collocation-finder? My version (below) creates a list of letters and not a list of words and just gives me letter-collocations. Any advice would be great!
This is what I have so far:
import pandas as pd
data = pd.read_csv("tweets.csv")
import regex as re
def cleaningTweets(twt):
twt = re.sub('@[A-ZÜÄÖa-züäöß0-9]+', '', twt)
twt = re.sub('#', '', twt)
twt = re.sub('https?:\/\/\S+', '', twt)
return twt
df = pd.DataFrame(data)
df.tweet = df.tweet.apply(cleaningTweets)
df.tweet = df.tweet.str.lower()
from textblob_de import TextBlobDE as TextBlob
df["tweet_tok"] = df["tweet"].apply(lambda x: " ".join(TextBlob(x).words))
all_words = ' '.join([text for text in df.tweet_tok])
tweettext = nltk.Text(all_words)
If all you are after is a list of words from a sentence, I think you are looking for the .split
method on a Python string
object. Pandas has a built-in method to apply string splitting to each row in a DataFrame (or Series), and expand out to individual columns if you need it.
For example, try this little piece of code and see if it does what you want:
import pandas as pd
strings_to_split = [
"i like to be beside the sea",
"me too"
]
pd.Series(strings_to_split).str.split(expand=True)
A couple of notes:
.split()
splits on whitespace, but you can pass any character to perform the split, eg .split('a')
expand=False
to keep the list in each row instead of expanding out to columns
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.