在 pandas 中创建两列的并集

Question

I have two dataframes with identical columns.我有两个具有相同列的数据框。 However the 'labels' column can have different labels.但是，“标签”列可以有不同的标签。 All labels are comma seperated strings.所有标签都是逗号分隔的字符串。 I want to make a union on the labels in order to go from this:我想在标签上建立一个联合，以便从这里得到 go：

df1: df1:

    id1   id2 labels language
0   1     1   1      en
1   2     3          en
2   3     4   4      en
3   4     5          en
4   5     6          en

df2: df2:

    id1   id2 labels language
0   1     1   1,2    en
1   2     3          en
2   3     4   5,7    en
3   4     5          en
4   5     6   3      en

to this:对此：

    id1   id2 labels language
0   1     1   1,2    en
1   2     3          en
2   3     4   4,5,7  en
3   4     5          en
4   5     6   3      en

I've tried this:我试过这个：

df1['labels'] = df1['labels'].apply(lambda x: set(str(x).split(',')))
df2['labels'] = df2['labels'].apply(lambda x: set(str(x).split(',')))
result = df1.merge(df2, on=['article_id', 'line_number', 'language'], how='outer')

result['labels'] = result[['labels_x', 'labels_y']].apply(lambda x: list(set.union(*x)) if None not in x else set(), axis=1)
result['labels'] = result['labels'].apply(lambda x: ','.join(set(x)))
result = result.drop(['labels_x', 'techniques_y'], axis=1)

but I get a wierd df with odd commas in some places, eg the ,3 .:但我在某些地方得到了一个奇怪的 df 和奇怪的逗号，例如,3 .:

    id1   id2 labels language
0   1     1   1,2    en
1   2     3          en
2   3     4   4,5,7  en
3   4     5          en
4   5     6   ,3     en

How can I properly fix the commas?如何正确修复逗号？ Any help is appreciated!任何帮助表示赞赏！

Answer 1

Here is a possible solution with pandas.merge :这是pandas.merge的可能解决方案：

out = (
        df1.merge(df2, on=["id1", "id2", "language"])
            .assign(labels= lambda x: x.filter(like="label")
                                       .stack().str.split(",")
                                       .explode().drop_duplicates()
                                       .groupby(level=0).agg(",".join))
            .drop(columns=["labels_x", "labels_y"])
             [df1.columns]
      )

Output: Output：

print(out)

  id1 id2 labels language
0   1   1    1,2       en
1   2   3    NaN       en
2   3   4  4,5,7       en
3   4   5    NaN       en
4   5   6      3       en

在 pandas 中创建两列的并集

问题描述

1 个解决方案

解决方案1
1 2023-01-31 01:06:17

在 pandas 中创建两列的并集

问题描述

1 个解决方案

解决方案1 1 2023-01-31 01:06:17

解决方案1
1 2023-01-31 01:06:17