简体   繁体   English

包括字符串修改中的字边界更具体

[英]including word boundary in string modification to be more specific

Background 背景

The following is a minor change from modification of skipping empty list and continuing with function 以下是对跳过空列表的修改和继续功能的微小更改

import pandas as pd
Names =    [list(['ann']),
               list([]),
               list(['elisabeth', 'lis']),
               list(['his','he']),
               list([])]
df = pd.DataFrame({'Text' : ['ann had an anniversery today', 
                                       'nothing here', 
                                       'I like elisabeth and lis 5 lists ',
                                        'one day he and his cheated',
                                        'same here'
                            ], 

                          'P_ID': [1,2,3, 4,5], 
                          'P_Name' : Names

                         })

#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
                  Text                P_ID  P_Name
0   ann had an anniversery today        1   [ann]
1   nothing here                        2   []
2   I like elisabeth and lis 5 lists    3   [elisabeth, lis]
3   one day he and his cheated          4   [his, he]
4   same here                           5   []

The code below works 以下代码有效

m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True) 

And does the following 并做以下事情

1) uses the name in P_Name to block the corresponding text in the Text column by placing **BLOCK** 1)使用P_Name的名称通过放置**BLOCK**来阻止Text列中的相应文本

2) produces a new column New 2)生成一个新列New

This is shown below 如下所示

   Text  P_ID P_Name  New
0                     **BLOCK** had an **BLOCK**iversery today
1                     NaN
2                     I like **BLOCK** and **BLOCK** 5 **BLOCK**ts
3                     one day **BLOCK** and **BLOCK** c**BLOCK**ated
4                     NaN

Problem 问题

However, this code works a little "too well." 但是,这段代码有点“太好了”。

Using ['his','he'] from P_Name to block Text : 使用P_Name ['his','he']来阻止Text

Example: one day he and his cheated becomes one day **BLOCK** and **BLOCK** c**BLOCK**ated 例如: one day he and his cheated变成one day **BLOCK** and **BLOCK** c**BLOCK**ated

Desired: one day he and his cheated becomes one day **BLOCK** and **BLOCK** cheated 渴望: one day he and his cheated变成one day **BLOCK** and **BLOCK** cheated

In this example, I would like cheated to stay as cheated and not become c**BLOCK**ated 在这个例子中,我想cheated作为cheated而不是成为c**BLOCK**ated

Desired Output 期望的输出

    Text P_ID P_Name  New
0                     **BLOCK** had an anniversery today
1                     NaN
2                     I like **BLOCK** and **BLOCK**5 lists
3                     one day **BLOCK** and **BLOCK** cheated
4                     NaN

Question

How do I achieve my desired output? 如何实现所需的输出?

You need to add word boundary to each string in lists of df.loc[m].P_Name as follows: 您需要为df.loc[m].P_Name列表中的每个字符串添加单词边界,如下所示:

s = df.loc[m].P_Name.map(lambda x: [r'\b'+item+r'\b' for item in x])

Out[71]:
0                   [\bann\b]
2    [\belisabeth\b, \blis\b]
3           [\bhis\b, \bhe\b]
Name: P_Name, dtype: object

df.loc[m, 'Text'].replace(s, '**BLOCK**',regex=True)

Out[72]:
0       **BLOCK** had an anniversery today
2    I like **BLOCK** and **BLOCK** 5 lists
3    one day **BLOCK** and **BLOCK** cheated
Name: Text, dtype: object

Sometime for loop is good practice 有时循环是很好的做法

df['New']=[pd.Series(x).replace(dict.fromkeys(y,'**PHI**') ).str.cat(sep=' ')for x , y in zip(df.Text.str.split(),df.P_Name)]
df.New.where(df.P_Name.astype(bool),inplace=True)
df
                                Text  ...                                  New
0       ann had an anniversery today  ...     **PHI** had an anniversery today
1                       nothing here  ...                                  NaN
2  I like elisabeth and lis 5 lists   ...   I like **PHI** and **PHI** 5 lists
3         one day he and his cheated  ...  one day **PHI** and **PHI** cheated
4                          same here  ...                                  NaN
[5 rows x 4 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM