[英]replace the duplicated sentences with word "same"
I would like to change the repeated comments with word "same" but keep the original ones and change the ID like below.我想用“相同”一词更改重复的评论,但保留原始评论并更改 ID,如下所示。 However, some comments are not matched exactly such as the last three.
但是,有些评论并不完全匹配,例如最后三个。
df = {'Key': ['111', '111','111', '222*1','222*2', '333*1','333*2', '333*3','444','444', '444'],
'id' : ['', '','', '1','2', '1','2', '3','', '','',],
'comment': ['wrong sentence', 'wrong sentence','wrong sentence', 'M','M', 'F','F', 'F','wrong sentence used in the topic', 'wrong sentence used','wrong sentence use']}
# Create DataFrame
df = pd.DataFrame(df)
print(df)
the input:输入:
Desired output:所需的 output:
ind = df['comment'].str.contains('wrong sentence')
def my_func(x):
if len(x['comment'].values[0]) > 1 and len(x) > 1 and ind[x.index[0]]:
df.loc[x.index[1:], 'comment'] = 'same'
df.loc[x.index, 'id'] = range(1, len(x)+1)
df.groupby('Key').apply(my_func)
print(df)
Output Output
Key id comment
0 111 1 wrong sentence
1 111 2 same
2 111 3 same
3 222*1 1 M
4 222*2 2 M
5 333*1 1 F
6 333*2 2 F
7 333*3 3 F
8 444 1 wrong sentence used in the topic
9 444 2 same
10 444 3 same
Here, contains is used to match 'wrong sentence'.在这里, 包含用于匹配“错误的句子”。 The result is a boolean mask.
结果是一个 boolean 掩码。
Groupby is applied on the 'Key' column, the grouping result is passed to the user-defined function: my_func
. Groupby应用于'Key'列,分组结果传递给用户自定义的 function:
my_func
。 Where the conditions are checked string is greater than 1, strings greater than 1 and matches the word 'wrong sentence'.其中检查条件是字符串大于1,字符串大于1且匹配单词'错句'。
loc is used to reset values. loc用于重置值。
Update更新
def my_func(x):
unic = x['comment'].str.slice(start=0, stop=10).value_counts().values[0]
clv = len(x)
if len(x['comment'].values[0]) > 1 and clv > 1 and unic == clv:
df.loc[x.index[1:], 'comment'] = 'same'
df.loc[x.index, 'id'] = range(1, clv+1)
df.groupby('Key').apply(my_func)
print(df)
Use:采用:
#test first 10 values for duplicates and no `M,F` values
m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
#create consecutive groups only for matched mask and create counter
counter = df.groupby((~m).cumsum().where(m)).cumcount().add(1)
#assign counter only for matched rows
df.loc[m, 'id'] = counter[m]
#assign same for duplicates - it means if counter values greater like 1
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
Key id comment
0 111 1 wrong sentence
1 111 2 same
2 111 3 same
3 222*1 1 M
4 222*2 2 M
5 333*1 1 F
6 333*2 2 F
7 333*3 3 F
8 444 1 wrong sentence used in the topic
9 444 2 same
10 444 3 same
If need also test duplicates per Key
groups:如果需要还测试每个
Key
组的重复项:
m = df['comment'].str[:10].duplicated(keep=False) & ~df['comment'].isin(['M','F'])
counter = df.groupby(['Key',(~m).cumsum().where(m)]).cumcount().add(1)
df.loc[m, 'id'] = counter[m]
df.loc[counter.gt(1) & m, 'comment'] = 'same'
print (df)
Key id comment
0 111 1 wrong sentence
1 111 2 same
2 111 3 same
3 222*1 1 M
4 222*2 2 M
5 333*1 1 F
6 333*2 2 F
7 333*3 3 F
8 444 1 wrong sentence used in the topic
9 444 2 same
10 444 3 same
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.