[英]Connect a dataframe into a function in Python
抱歉这是一个非常基本的问题,但我对 Python 完全陌生(我之前只使用过 R,因为这是我在大学里教的,诚然不是很高的水平)所以我不知道该怎么做.
I am performing sentiment analysis on tweets, and found a pre-trained sentiment analysis package (RoBERTa) which runs on Python - I have aggregated and cleaned all my data in R, and now have a CSV with a column with the cleaned tweets.
这是我正在使用的代码:
! pip install transformers
! pip install scipy
import pandas as pd
import io
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['example_cleaned_tweets.csv']))
print(df)
tweet = "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."
print(tweet)
# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']
encoded_tweet = tokenizer(tweet, return_tensors='pt')
print(encoded_tweet)
# sentiment analysis
output = model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
for i in range(len(scores)):
l = labels[i]
s = scores[i]
print(l,s)
我从有关如何使用我正在使用的 package 的指南中获取了很多内容,但删除了数据处理阶段。
我已将 csv 作为 dataframe 导入 - 任何人都可以帮助了解如何使用我的 dataframe 中的“cleaned_tweets”列而不是手动输入“文本”。 我将如何为 clean_tweets 变量生成 dataframe 中每一行的情绪分数,然后 append 为每一行的 Z6A8064B5DF479455500553C47C550 生成负/中性/正分数?
对不起,基本问题,非常感谢任何帮助!
使用df.cleaned_tweets
或df["cleaned_tweets"]
这将为您提供 pandas 系列 object
df[["cleaned_tweets"]]
会给你一个 dataframe
如果使用 model,则可以传递整个 pandas dataframe 进行预测。
df_results = model.predict(df["cleaned_tweets"])
如果您使用令牌,则可以使用 str 列表的文档state:
text (str, List[str], List[List[str]]) — 要编码的序列或序列批次。 每个序列可以是字符串或字符串列表(预标记字符串)。 如果序列作为字符串列表(预标记)提供,则必须设置 is_split_into_words=True (以消除一批序列的歧义)。
您只需要将 pandas 列转换为列表:
list_of_cleaned_tweets = df['cleaned_tweets'].tolist()
这是我用来为未来的任何人运行脚本的代码:
! pip install transformers
! pip install scipy
import pandas as pd
import io
import numpy as np
from google.colab import files
uploaded = files.upload()
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']
df = pd.read_csv('nameofcsv.csv')
# probably unnecessary but the len call could be expensive to do multiple times on large datasets
total_tweets = len(df['cleaned_tweets'])
# adds the columns for negative, neutral, positive
for label in labels:
df[label] = [np.nan]*total_tweets
for i, tweet in enumerate(df['cleaned_tweets']):
if tweet is not np.nan:
encoded_tweet = tokenizer(tweet, return_tensors='pt')
# sentiment analysis
output = model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
for label, score in zip(labels, scores):
df[label][i] = score
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.