由於來自不同行的文本值組合在其他 pandas 列中具有相同值，因此創建新的 pandas 行

Question

由於連接在其他列中具有相同值的文本值，我想創建一個新的 pandas 數據框。 例如，我得到了以下 dataframe：

example_dct = {
  "text": {
    "0": "this is my text 1",
    "1": "this is my text 2",
    "2": "this is my text 3",
    "3": "this is my text 4",
    "4": "this is my text 5"
  },
  "article_id": {
    "0": "#0001_01_xml",
    "1": "#0001_01_xml",
    "2": "#0001_02_xml",
    "3": "#0001_03_xml",
    "4": "#0001_03_xml"
  }
}

df_example = pd.DataFrame.from_dict(example_dct) 
print(df_example)

         text           article_id
0  this is my text 1  #0001_01_xml
1  this is my text 2  #0001_01_xml
2  this is my text 3  #0001_02_xml
3  this is my text 4  #0001_03_xml
4  this is my text 5  #0001_03_xml

我想用以下方式連接： text1+'***' +text2

因此，在這種情況下 idx 0,1 應該連接起來，而 3, 4

因此，結果 dataframe 將是：

            text                                article_id
0  'this is my text 1 *** this is my text 2'  #0001_01_xml
1  'this is my text 4 *** this is my text 5'  #0001_03_xml

如果有 >2 個文本值具有相同的 id 值，例如：

example_dct = {
  "text": {
    "0": "this is my text 1",
    "1": "this is my text 2",
    "2": "this is my text 3",
    "3": "this is my text 4",
    "4": "this is my text 5",
    "5": "this is my text 6",
  },
  "article_id": {
    "0": "#0001_01_xml",
    "1": "#0001_01_xml",
    "2": "#0001_02_xml",
    "3": "#0001_03_xml",
    "4": "#0001_03_xml", 
    "5": "#0001_03_xml",
  }
}

那么 output dataframe 應該是 1 x 1 文本連接的結果：

            text                                article_id
0  'this is my text 1 *** this is my text 2'  #0001_01_xml
1  'this is my text 4 *** this is my text 5'  #0001_03_xml
2  'this is my text 4 *** this is my text 6'  #0001_03_xml
3  'this is my text 5 *** this is my text 6'  #0001_03_xml

我一直在嘗試應用一些 groupby 查詢，將所有具有相同列值的文本連接起來，即df.groupby('article_id', sort=False)['text'].apply('***'.join)創建只有一行，但我想如上所述創建 1by1 行

有什么想法可以采用這種方法嗎？

Answer 1

在article_id上使用DataFrame.groupby並使用自定義Series.explode Series.dropna在text列中生成所有可能的length=2字符串組合，最后使用 Series。

from itertools import combinations

f = lambda g: [*map(' *** '.join, combinations(g['text'], r=2))]
df = df.groupby('article_id').apply(f).explode().dropna().reset_index(name='text')

結果：

# example1
     article_id                                     text
0  #0001_01_xml  this is my text 1 *** this is my text 2
1  #0001_03_xml  this is my text 4 *** this is my text 5

# example 2
     article_id                                     text
0  #0001_01_xml  this is my text 1 *** this is my text 2
1  #0001_03_xml  this is my text 4 *** this is my text 5
2  #0001_03_xml  this is my text 4 *** this is my text 6
3  #0001_03_xml  this is my text 5 *** this is my text 6

由於來自不同行的文本值組合在其他 pandas 列中具有相同值，因此創建新的 pandas 行

問題描述

1 個解決方案

解決方案1
2 已采納 2020-08-12 11:42:26

由於來自不同行的文本值組合在其他 pandas 列中具有相同值，因此創建新的 pandas 行

問題描述

1 個解決方案

解決方案1 2 已采納 2020-08-12 11:42:26

解決方案1
2 已采納 2020-08-12 11:42:26