I have a dataframe with 2 columns:
+-----------+----------+
| Tweet | Language |
+-----------+----------+
| some text | en |
| more text | en |
| ein text | de |
+-----------+----------+
(the text in the Tweet column are actual tweets)
I want to apply a language detection algorithm to see how many german(de) tweets I have in my df.
from langdetect import detect
nlp = detect
This works, but only adds the tweet to temp_list
temp_list = [row for row in df['Tweet'] if nlp(row)=='de']
However, what I want, is to add the entire row to temp_list if the language detection algorithm labels it as german. I want to include both columns, so I can cross-check with my Language column(which I labeled manually).
If you want the full dataframe output, and your dataframe is called nlp
then you should use:
filtered_df = nlp[nlp['Language'] == 'de']
If you want only the Tweets
column, then:
filtered_df = nlp[nlp['Language'] == 'de']['Tweets']
Finally, if you want to make a list out of those values:
df_filtered = df[df['Language'] =='de']['Tweets'].tolist()
1st:
Tweets Language
2 Deutsch de
2nd:
2 Deutsch
3rd:
['Deutsch']
You could use apply
df[df['Language']==df['Tweet'].apply(nlp)]
and that would return a dataframe
You could also create a new column like detected_lang
df['detected_lang']=df['Tweet'].apply(nlp)
print(df)
Tweet Language detected_lang
0 some text en sv
1 more text en en
2 ein text de de
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.