由于来自不同行的文本值组合在其他 pandas 列中具有相同值，因此创建新的 pandas 行

Question

由于连接在其他列中具有相同值的文本值，我想创建一个新的 pandas 数据框。 例如，我得到了以下 dataframe：

example_dct = {
  "text": {
    "0": "this is my text 1",
    "1": "this is my text 2",
    "2": "this is my text 3",
    "3": "this is my text 4",
    "4": "this is my text 5"
  },
  "article_id": {
    "0": "#0001_01_xml",
    "1": "#0001_01_xml",
    "2": "#0001_02_xml",
    "3": "#0001_03_xml",
    "4": "#0001_03_xml"
  }
}

df_example = pd.DataFrame.from_dict(example_dct) 
print(df_example)

         text           article_id
0  this is my text 1  #0001_01_xml
1  this is my text 2  #0001_01_xml
2  this is my text 3  #0001_02_xml
3  this is my text 4  #0001_03_xml
4  this is my text 5  #0001_03_xml

我想用以下方式连接： text1+'***' +text2

因此，在这种情况下 idx 0,1 应该连接起来，而 3, 4

因此，结果 dataframe 将是：

            text                                article_id
0  'this is my text 1 *** this is my text 2'  #0001_01_xml
1  'this is my text 4 *** this is my text 5'  #0001_03_xml

如果有 >2 个文本值具有相同的 id 值，例如：

example_dct = {
  "text": {
    "0": "this is my text 1",
    "1": "this is my text 2",
    "2": "this is my text 3",
    "3": "this is my text 4",
    "4": "this is my text 5",
    "5": "this is my text 6",
  },
  "article_id": {
    "0": "#0001_01_xml",
    "1": "#0001_01_xml",
    "2": "#0001_02_xml",
    "3": "#0001_03_xml",
    "4": "#0001_03_xml", 
    "5": "#0001_03_xml",
  }
}

那么 output dataframe 应该是 1 x 1 文本连接的结果：

            text                                article_id
0  'this is my text 1 *** this is my text 2'  #0001_01_xml
1  'this is my text 4 *** this is my text 5'  #0001_03_xml
2  'this is my text 4 *** this is my text 6'  #0001_03_xml
3  'this is my text 5 *** this is my text 6'  #0001_03_xml

我一直在尝试应用一些 groupby 查询，将所有具有相同列值的文本连接起来，即df.groupby('article_id', sort=False)['text'].apply('***'.join)创建只有一行，但我想如上所述创建 1by1 行

有什么想法可以采用这种方法吗？

Answer 1

在article_id上使用DataFrame.groupby并使用自定义Series.explode Series.dropna在text列中生成所有可能的length=2字符串组合，最后使用 Series。

from itertools import combinations

f = lambda g: [*map(' *** '.join, combinations(g['text'], r=2))]
df = df.groupby('article_id').apply(f).explode().dropna().reset_index(name='text')

结果：

# example1
     article_id                                     text
0  #0001_01_xml  this is my text 1 *** this is my text 2
1  #0001_03_xml  this is my text 4 *** this is my text 5

# example 2
     article_id                                     text
0  #0001_01_xml  this is my text 1 *** this is my text 2
1  #0001_03_xml  this is my text 4 *** this is my text 5
2  #0001_03_xml  this is my text 4 *** this is my text 6
3  #0001_03_xml  this is my text 5 *** this is my text 6

由于来自不同行的文本值组合在其他 pandas 列中具有相同值，因此创建新的 pandas 行

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-08-12 11:42:26

由于来自不同行的文本值组合在其他 pandas 列中具有相同值，因此创建新的 pandas 行

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-08-12 11:42:26

解决方案1
2 已采纳 2020-08-12 11:42:26