繁体   English   中英

Apply zero-shot transformer model to each row and create new column(s) in pandas for the appropriate label (custom function and.apply)

[英]Apply zero-shot transformer model to each row and create new column(s) in pandas for the appropriate label (custom function and .apply)

现在我有这个 Huggingface 变压器管道,它可以进行零样本分类。 我想将其应用于调查数据集的开放式答案列,在其中我按行运行 model 并创建一个名为“主题”的新列(概率得分最高的 label)和一个名为“prob”的第二列添加到“主题”列的 label 的概率分数。

到目前为止,我有这个模拟数据集:

                                             open_text  col_2  col_3
0  The way he threw that 3-pointer was incredible       NaN    NaN
1  On election day, people tend to queue for way ...    NaN    NaN
2  She did not order it because she was already full    NaN    NaN
3  He enjoyed his hot-dog watching the Lakers game      NaN    NaN

我写了这个自定义 function:

def zeroshotPipeline(text):
    input_ids = text
    candidate_label = ['basketball', 'politics', 'food']
    template = "This example is {}"
    results = classifier(input_ids, 
                         candidate_label,
                        # multi_label = True,
                         hypothesis_template = template)
    score_id = np.argmax(results["scores"])
    final_label = results["labels"][score_id]
    prob = results["scores"][score_id]
    return final_label, prob


df["theme"] = ""
df["prob"] = np.nan

df['theme'] = df["open_text"].apply(zeroshotPipeline)
print(df)

                                           open_text  ...                            theme
0     The way he threw that 3-pointer was incredible  ...  (politics, 0.12852472066879272)
1  On election day, people tend to queue for way ...  ...   (politics, 0.9359661340713501)
2  She did not order it because she was already full  ...       (food, 0.9898027181625366)
3    He enjoyed his hot-dog watching the Lakers game  ...         (food, 0.99793541431427)

如您所见,当我打印 df 列“主题”由 label 和概率组成,用逗号分隔。 我希望这些在他们的每个专栏中。 这是怎么做到的?

此外,有没有一种方法可以在 function 中添加 multi_label = True,并根据概率为 > 0.60 的标签添加列 theme_1、theme_2 (..)? 所以在上面的例子中,第 4 行的结果(设置 multi-label = True 时):

{'sequence': 'He enjoyed his hot-dog watching the Lakers game', 'labels': ['food', 'basketball', 'politics'], 'scores': [0.99793541431427, 0.9612331390380859, 0.01709340512752533]}

然后我想包括两个新列(theme_1 和 theme_2),其中包含标签“食物”和“篮球”,然后是两列(prob_1 和 prob_2),它们对应的概率是值。

我对 python(来自 R)比较陌生,所以这种相当“简单”的争吵只是我在 python 中难以实现的。

谢谢

首先将具有元组的列拆分为仅包含该列的新 dataframe。 做这样的事情:

split_df = pd.DataFrame(df['theme'].tolist(), columns=['theme', 'score'])

然后从最初的 dataframe 中删除要更改的列,如下所示:

df.drop('theme')

然后将两个数据框组合在一起,如下所示:

df = pd.concat([df, split_df], axis=1)

那应该为你做。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM