将 dataframe 连接到 Python 中的 function

Question

抱歉这是一个非常基本的问题，但我对 Python 完全陌生（我之前只使用过 R，因为这是我在大学里教的，诚然不是很高的水平）所以我不知道该怎么做.

I am performing sentiment analysis on tweets, and found a pre-trained sentiment analysis package (RoBERTa) which runs on Python - I have aggregated and cleaned all my data in R, and now have a CSV with a column with the cleaned tweets.

这是我正在使用的代码：

! pip install transformers
! pip install scipy 
import pandas as pd
import io

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

from google.colab import files
uploaded = files.upload()

df = pd.read_csv(io.BytesIO(uploaded['example_cleaned_tweets.csv']))
print(df)

tweet = "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."
print(tweet)

# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

encoded_tweet = tokenizer(tweet, return_tensors='pt')
print(encoded_tweet)

# sentiment analysis
output = model(**encoded_tweet)

scores = output[0][0].detach().numpy()
scores = softmax(scores)

for i in range(len(scores)):
    
    l = labels[i]
    s = scores[i]
    print(l,s)

我从有关如何使用我正在使用的 package 的指南中获取了很多内容，但删除了数据处理阶段。

我已将 csv 作为 dataframe 导入 - 任何人都可以帮助了解如何使用我的 dataframe 中的“cleaned_tweets”列而不是手动输入“文本”。 我将如何为 clean_tweets 变量生成 dataframe 中每一行的情绪分数，然后 append 为每一行的 Z6A8064B5DF479455500553C47C550 生成负/中性/正分数？

对不起，基本问题，非常感谢任何帮助！

Answer 1

使用df.cleaned_tweets或df["cleaned_tweets"]这将为您提供 pandas 系列 object

df[["cleaned_tweets"]]会给你一个 dataframe

Answer 2

如果使用 model，则可以传递整个 pandas dataframe 进行预测。

df_results = model.predict(df["cleaned_tweets"])

如果您使用令牌，则可以使用 str 列表的文档state：

text (str, List[str], List[List[str]]) — 要编码的序列或序列批次。 每个序列可以是字符串或字符串列表（预标记字符串）。 如果序列作为字符串列表（预标记）提供，则必须设置 is_split_into_words=True （以消除一批序列的歧义）。

您只需要将 pandas 列转换为列表：

 list_of_cleaned_tweets = df['cleaned_tweets'].tolist()

Answer 3

这是我用来为未来的任何人运行脚本的代码：

! pip install transformers
! pip install scipy 
import pandas as pd
import io
import numpy as np

from google.colab import files
uploaded = files.upload()

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment-latest"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

df = pd.read_csv('nameofcsv.csv')
# probably unnecessary but the len call could be expensive to do multiple times on large datasets
total_tweets = len(df['cleaned_tweets'])

# adds the columns for negative, neutral, positive
for label in labels:
    df[label] = [np.nan]*total_tweets

for i, tweet in enumerate(df['cleaned_tweets']):
    if tweet is not np.nan:
        encoded_tweet = tokenizer(tweet, return_tensors='pt')

        # sentiment analysis
        output = model(**encoded_tweet)

        scores = output[0][0].detach().numpy()
        scores = softmax(scores)

        for label, score in zip(labels, scores):
            df[label][i] = score

    
print(df)

将 dataframe 连接到 Python 中的 function

问题描述

3 个解决方案

解决方案1
0 2022-09-06 15:30:52

解决方案2
0 2022-09-06 15:32:38

解决方案3
0 2022-09-08 09:50:17

将 dataframe 连接到 Python 中的 function

问题描述

3 个解决方案

解决方案1 0 2022-09-06 15:30:52

解决方案2 0 2022-09-06 15:32:38

解决方案3 0 2022-09-08 09:50:17

解决方案1
0 2022-09-06 15:30:52

解决方案2
0 2022-09-06 15:32:38

解决方案3
0 2022-09-08 09:50:17