用“相同”一词替换重复的句子

Question

I would like to change the repeated comments with word "same" but keep the original ones and change the ID like below.我想用“相同”一词更改重复的评论，但保留原始评论并更改 ID，如下所示。 However, some comments are not matched exactly such as the last three.但是，有些评论并不完全匹配，例如最后三个。

df = {'Key': ['111', '111','111', '222*1','222*2', '333*1','333*2', '333*3','444','444', '444'],
      'id' : ['', '','', '1','2', '1','2', '3','', '','',],
        'comment': ['wrong sentence', 'wrong sentence','wrong sentence', 'M','M', 'F','F', 'F','wrong sentence used in the topic', 'wrong sentence used','wrong sentence use']}
  
# Create DataFrame
df = pd.DataFrame(df)

print(df)

the input:输入：

Desired output:所需的 output：

Answer 1

ind = df['comment'].str.contains('wrong sentence')

def my_func(x):
    if len(x['comment'].values[0]) > 1 and len(x) > 1 and ind[x.index[0]]:
        df.loc[x.index[1:], 'comment'] = 'same'
        df.loc[x.index, 'id'] = range(1, len(x)+1)

df.groupby('Key').apply(my_func)

print(df)

Output Output

      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

Here, contains is used to match 'wrong sentence'.在这里，包含用于匹配“错误的句子”。 The result is a boolean mask.结果是一个 boolean 掩码。

Groupby is applied on the 'Key' column, the grouping result is passed to the user-defined function: my_func . Groupby应用于'Key'列，分组结果传递给用户自定义的 function: my_func 。 Where the conditions are checked string is greater than 1, strings greater than 1 and matches the word 'wrong sentence'.其中检查条件是字符串大于1，字符串大于1且匹配单词'错句'。

loc is used to reset values. loc用于重置值。

Update更新

def my_func(x):
    unic = x['comment'].str.slice(start=0, stop=10).value_counts().values[0]
    clv = len(x)
    if len(x['comment'].values[0]) > 1 and clv > 1 and unic == clv:
        df.loc[x.index[1:], 'comment'] = 'same'
        df.loc[x.index, 'id'] = range(1, clv+1)

df.groupby('Key').apply(my_func)

print(df)

Answer 2

Use:采用：

#test first 10 values for duplicates and no `M,F` values
m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
#create consecutive groups only for matched mask and create counter
counter = df.groupby((~m).cumsum().where(m)).cumcount().add(1)

#assign counter only for matched rows
df.loc[m, 'id'] = counter[m]

#assign same for duplicates - it means if counter values greater like 1
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

If need also test duplicates per Key groups:如果需要还测试每个Key组的重复项：

m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
counter = df.groupby(['Key',(~m).cumsum().where(m)]).cumcount().add(1)

df.loc[m, 'id'] = counter[m]
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

用“相同”一词替换重复的句子

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-12-14 10:18:26

解决方案2
1 2022-12-14 10:36:00

用“相同”一词替换重复的句子

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-12-14 10:18:26

解决方案2 1 2022-12-14 10:36:00

解决方案1
1 已采纳 2022-12-14 10:18:26

解决方案2
1 2022-12-14 10:36:00