繁体   English   中英

将 dataframe 连接到 Python 中的 function

[英]Connect a dataframe into a function in Python

抱歉这是一个非常基本的问题,但我对 Python 完全陌生(我之前只使用过 R,因为这是我在大学里教的,诚然不是很高的水平)所以我不知道该怎么做.

I am performing sentiment analysis on tweets, and found a pre-trained sentiment analysis package (RoBERTa) which runs on Python - I have aggregated and cleaned all my data in R, and now have a CSV with a column with the cleaned tweets.

这是我正在使用的代码:

! pip install transformers
! pip install scipy 
import pandas as pd
import io

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

from google.colab import files
uploaded = files.upload()

df = pd.read_csv(io.BytesIO(uploaded['example_cleaned_tweets.csv']))
print(df)

tweet = "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."
print(tweet)

# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

encoded_tweet = tokenizer(tweet, return_tensors='pt')
print(encoded_tweet)

# sentiment analysis
output = model(**encoded_tweet)

scores = output[0][0].detach().numpy()
scores = softmax(scores)

for i in range(len(scores)):
    
    l = labels[i]
    s = scores[i]
    print(l,s)

我从有关如何使用我正在使用的 package 的指南中获取了很多内容,但删除了数据处理阶段。

我已将 csv 作为 dataframe 导入 - 任何人都可以帮助了解如何使用我的 dataframe 中的“cleaned_tweets”列而不是手动输入“文本”。 我将如何为 clean_tweets 变量生成 dataframe 中每一行的情绪分数,然后 append 为每一行的 Z6A8064B5DF479455500553C47C550 生成负/中性/正分数?

对不起,基本问题,非常感谢任何帮助!

使用df.cleaned_tweetsdf["cleaned_tweets"]这将为您提供 pandas 系列 object

df[["cleaned_tweets"]]会给你一个 dataframe

如果使用 model,则可以传递整个 pandas dataframe 进行预测。

df_results = model.predict(df["cleaned_tweets"])

如果您使用令牌,则可以使用 str 列表的文档state:

text (str, List[str], List[List[str]]) — 要编码的序列或序列批次。 每个序列可以是字符串或字符串列表(预标记字符串)。 如果序列作为字符串列表(预标记)提供,则必须设置 is_split_into_words=True (以消除一批序列的歧义)。

您只需要将 pandas 列转换为列表:

 list_of_cleaned_tweets = df['cleaned_tweets'].tolist()

这是我用来为未来的任何人运行脚本的代码:

! pip install transformers
! pip install scipy 
import pandas as pd
import io
import numpy as np

from google.colab import files
uploaded = files.upload()

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment-latest"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

df = pd.read_csv('nameofcsv.csv')
# probably unnecessary but the len call could be expensive to do multiple times on large datasets
total_tweets = len(df['cleaned_tweets'])

# adds the columns for negative, neutral, positive
for label in labels:
    df[label] = [np.nan]*total_tweets

for i, tweet in enumerate(df['cleaned_tweets']):
    if tweet is not np.nan:
        encoded_tweet = tokenizer(tweet, return_tensors='pt')

        # sentiment analysis
        output = model(**encoded_tweet)

        scores = output[0][0].detach().numpy()
        scores = softmax(scores)

        for label, score in zip(labels, scores):
            df[label][i] = score

    
print(df)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM