![](/img/trans.png)
[英]Apply function to each row of pandas dataframe to create two new columns
[英]Apply zero-shot transformer model to each row and create new column(s) in pandas for the appropriate label (custom function and .apply)
现在我有这个 Huggingface 变压器管道,它可以进行零样本分类。 我想将其应用于调查数据集的开放式答案列,在其中我按行运行 model 并创建一个名为“主题”的新列(概率得分最高的 label)和一个名为“prob”的第二列添加到“主题”列的 label 的概率分数。
到目前为止,我有这个模拟数据集:
open_text col_2 col_3
0 The way he threw that 3-pointer was incredible NaN NaN
1 On election day, people tend to queue for way ... NaN NaN
2 She did not order it because she was already full NaN NaN
3 He enjoyed his hot-dog watching the Lakers game NaN NaN
我写了这个自定义 function:
def zeroshotPipeline(text):
input_ids = text
candidate_label = ['basketball', 'politics', 'food']
template = "This example is {}"
results = classifier(input_ids,
candidate_label,
# multi_label = True,
hypothesis_template = template)
score_id = np.argmax(results["scores"])
final_label = results["labels"][score_id]
prob = results["scores"][score_id]
return final_label, prob
df["theme"] = ""
df["prob"] = np.nan
df['theme'] = df["open_text"].apply(zeroshotPipeline)
print(df)
open_text ... theme
0 The way he threw that 3-pointer was incredible ... (politics, 0.12852472066879272)
1 On election day, people tend to queue for way ... ... (politics, 0.9359661340713501)
2 She did not order it because she was already full ... (food, 0.9898027181625366)
3 He enjoyed his hot-dog watching the Lakers game ... (food, 0.99793541431427)
如您所见,当我打印 df 列“主题”由 label 和概率组成,用逗号分隔。 我希望这些在他们的每个专栏中。 这是怎么做到的?
此外,有没有一种方法可以在 function 中添加 multi_label = True,并根据概率为 > 0.60 的标签添加列 theme_1、theme_2 (..)? 所以在上面的例子中,第 4 行的结果(设置 multi-label = True 时):
{'sequence': 'He enjoyed his hot-dog watching the Lakers game', 'labels': ['food', 'basketball', 'politics'], 'scores': [0.99793541431427, 0.9612331390380859, 0.01709340512752533]}
然后我想包括两个新列(theme_1 和 theme_2),其中包含标签“食物”和“篮球”,然后是两列(prob_1 和 prob_2),它们对应的概率是值。
我对 python(来自 R)比较陌生,所以这种相当“简单”的争吵只是我在 python 中难以实现的。
谢谢
首先将具有元组的列拆分为仅包含该列的新 dataframe。 做这样的事情:
split_df = pd.DataFrame(df['theme'].tolist(), columns=['theme', 'score'])
然后从最初的 dataframe 中删除要更改的列,如下所示:
df.drop('theme')
然后将两个数据框组合在一起,如下所示:
df = pd.concat([df, split_df], axis=1)
那应该为你做。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.