简体   繁体   English

用“相同”一词替换重复的句子

[英]replace the duplicated sentences with word "same"

I would like to change the repeated comments with word "same" but keep the original ones and change the ID like below.我想用“相同”一词更改重复的评论,但保留原始评论并更改 ID,如下所示。 However, some comments are not matched exactly such as the last three.但是,有些评论并不完全匹配,例如最后三个。

df = {'Key': ['111', '111','111', '222*1','222*2', '333*1','333*2', '333*3','444','444', '444'],
      'id' : ['', '','', '1','2', '1','2', '3','', '','',],
        'comment': ['wrong sentence', 'wrong sentence','wrong sentence', 'M','M', 'F','F', 'F','wrong sentence used in the topic', 'wrong sentence used','wrong sentence use']}
  
# Create DataFrame
df = pd.DataFrame(df)

print(df)

the input:输入:

在此处输入图像描述

Desired output:所需的 output:

在此处输入图像描述

ind = df['comment'].str.contains('wrong sentence')

def my_func(x):
    if len(x['comment'].values[0]) > 1 and len(x) > 1 and ind[x.index[0]]:
        df.loc[x.index[1:], 'comment'] = 'same'
        df.loc[x.index, 'id'] = range(1, len(x)+1)

df.groupby('Key').apply(my_func)

print(df)

Output Output

      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

Here, contains is used to match 'wrong sentence'.在这里, 包含用于匹配“错误的句子”。 The result is a boolean mask.结果是一个 boolean 掩码。

Groupby is applied on the 'Key' column, the grouping result is passed to the user-defined function: my_func . Groupby应用于'Key'列,分组结果传递给用户自定义的 function: my_func Where the conditions are checked string is greater than 1, strings greater than 1 and matches the word 'wrong sentence'.其中检查条件是字符串大于1,字符串大于1且匹配单词'错句'。

loc is used to reset values. loc用于重置值。

Update更新

def my_func(x):
    unic = x['comment'].str.slice(start=0, stop=10).value_counts().values[0]
    clv = len(x)
    if len(x['comment'].values[0]) > 1 and clv > 1 and unic == clv:
        df.loc[x.index[1:], 'comment'] = 'same'
        df.loc[x.index, 'id'] = range(1, clv+1)

df.groupby('Key').apply(my_func)

print(df)

Use:采用:

#test first 10 values for duplicates and no `M,F` values
m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
#create consecutive groups only for matched mask and create counter
counter = df.groupby((~m).cumsum().where(m)).cumcount().add(1)

#assign counter only for matched rows
df.loc[m, 'id'] = counter[m]

#assign same for duplicates - it means if counter values greater like 1
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

If need also test duplicates per Key groups:如果需要还测试每个Key组的重复项:

m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
counter = df.groupby(['Key',(~m).cumsum().where(m)]).cumcount().add(1)

df.loc[m, 'id'] = counter[m]
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
      Key id                           comment
0     111  1                    wrong sentence
1     111  2                              same
2     111  3                              same
3   222*1  1                                 M
4   222*2  2                                 M
5   333*1  1                                 F
6   333*2  2                                 F
7   333*3  3                                 F
8     444  1  wrong sentence used in the topic
9     444  2                              same
10    444  3                              same

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在句子中找到一个单词并将整个句子替换为一个数字 - find a word in the sentences and replace whole sentence with a number 如何用python有效替换word文档中的句子 - How to effectively replace sentences in word document with python 计算包含相同字数的句子数 - Count number of sentences containing same word count 用列表中的单词替换句子中的单词并复制列中的新句子 - Replace a word in a sentence with words from a list and copying the new sentences in a column 正则表达式提取所有以相同单词开头和结尾的句子 - Regular expression to extract all sentences that start and end with the same word NLTK在Python中生成没有两次出现相同单词的句子 - NLTK Generate sentences without two occurences of the same word in Python Python:用单词列表替换句子中的一个单词,并将新句子放在 pandas 的另一列中 - Python: Replace one word in a sentence with a list of words and put thenew sentences in another column in pandas 来自TfIdfTransformer的TF-IDF得分来自两个句子中相同频率的相同单词 - TF-IDF score from TfIdfTransformer in sklearn on same word in two sentences with same frequency pandas:管理不同列上的重复句子 - pandas: manage duplicated sentences on different columns 使用 Pandas 在数据框中查找重复句子的数量 - using pandas to find number of duplicated sentences in a dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM