Apply zero-shot transformer model to each row and create new column(s) in pandas for the appropriate label (custom function and.apply)

Question

现在我有这个 Huggingface 变压器管道，它可以进行零样本分类。 我想将其应用于调查数据集的开放式答案列，在其中我按行运行 model 并创建一个名为“主题”的新列（概率得分最高的 label）和一个名为“prob”的第二列添加到“主题”列的 label 的概率分数。

到目前为止，我有这个模拟数据集：

                                             open_text  col_2  col_3
0  The way he threw that 3-pointer was incredible       NaN    NaN
1  On election day, people tend to queue for way ...    NaN    NaN
2  She did not order it because she was already full    NaN    NaN
3  He enjoyed his hot-dog watching the Lakers game      NaN    NaN

我写了这个自定义 function：

def zeroshotPipeline(text):
    input_ids = text
    candidate_label = ['basketball', 'politics', 'food']
    template = "This example is {}"
    results = classifier(input_ids, 
                         candidate_label,
                        # multi_label = True,
                         hypothesis_template = template)
    score_id = np.argmax(results["scores"])
    final_label = results["labels"][score_id]
    prob = results["scores"][score_id]
    return final_label, prob


df["theme"] = ""
df["prob"] = np.nan

df['theme'] = df["open_text"].apply(zeroshotPipeline)
print(df)

                                           open_text  ...                            theme
0     The way he threw that 3-pointer was incredible  ...  (politics, 0.12852472066879272)
1  On election day, people tend to queue for way ...  ...   (politics, 0.9359661340713501)
2  She did not order it because she was already full  ...       (food, 0.9898027181625366)
3    He enjoyed his hot-dog watching the Lakers game  ...         (food, 0.99793541431427)

如您所见，当我打印 df 列“主题”由 label 和概率组成，用逗号分隔。 我希望这些在他们的每个专栏中。 这是怎么做到的？

此外，有没有一种方法可以在 function 中添加 multi_label = True，并根据概率为 > 0.60 的标签添加列 theme_1、theme_2 (..)？ 所以在上面的例子中，第 4 行的结果（设置 multi-label = True 时）：

{'sequence': 'He enjoyed his hot-dog watching the Lakers game', 'labels': ['food', 'basketball', 'politics'], 'scores': [0.99793541431427, 0.9612331390380859, 0.01709340512752533]}

然后我想包括两个新列（theme_1 和 theme_2），其中包含标签“食物”和“篮球”，然后是两列（prob_1 和 prob_2），它们对应的概率是值。

我对 python（来自 R）比较陌生，所以这种相当“简单”的争吵只是我在 python 中难以实现的。

谢谢

Answer 1

首先将具有元组的列拆分为仅包含该列的新 dataframe。 做这样的事情：

split_df = pd.DataFrame(df['theme'].tolist(), columns=['theme', 'score'])

然后从最初的 dataframe 中删除要更改的列，如下所示：

df.drop('theme')

然后将两个数据框组合在一起，如下所示：

df = pd.concat([df, split_df], axis=1)

那应该为你做。

Apply zero-shot transformer model to each row and create new column(s) in pandas for the appropriate label (custom function and.apply)

问题描述

1 个解决方案

解决方案1
0 2022-08-25 09:44:59

Apply zero-shot transformer model to each row and create new column(s) in pandas for the appropriate label (custom function and.apply)

问题描述

1 个解决方案

解决方案1 0 2022-08-25 09:44:59

解决方案1
0 2022-08-25 09:44:59