[英]Checking ending characters in dataframe and replacing them
I would like to add two new columns in my pandas dataframe based on the following conditions我想根据以下条件在我的 Pandas 数据框中添加两个新列
Something like this:像这样的东西:
Text
bla bla bla ...
once upon a time
pretty little liars
Batman ...
Expected预期的
Text T Clean
bla bla bla ... 1 bla bla bla
once upon a time 0 once upon a time
pretty little liars 0 pretty little liars
Batman ... 1 Batman
I tried to apply regex, but probably str.endwith would be a better approach to check if a sentence ends with ..., since assigns a boolean value (my T column).我尝试应用正则表达式,但 str.endwith 可能是检查句子是否以 ... 结尾的更好方法,因为分配了一个布尔值(我的 T 列)。
I have tried as follows: df['Text'].str.endswith('...')
but I would need to create a new column with 1 and 0. For cleaning the text I would check if T
is true: if it is true, I would remove the ...
at the end.我试过如下:
df['Text'].str.endswith('...')
但我需要用 1 和 0 创建一个新列。为了清理文本,我会检查T
是否为真:如果是的,我会在最后删除...
df['Clean'] = df['Text'].str.rstrip('...')
or df['Clean'] = df['Text'].str[:-3]
(but it does not include any logical condition or information on ...
)或
df['Clean'] = df['Text'].str[:-3]
(但它不包括任何逻辑条件或有关...
信息)
or df['Clean'] = df['Text'].str.replace(r'...$', '')
或
df['Clean'] = df['Text'].str.replace(r'...$', '')
It is important that I consider the sentence ending with ...
in order to avoid to delete ...
in the middle of sentence which have a different meaning.重要的是我考虑以
...
结尾的句子,以避免删除...
在句子中间具有不同含义。
For the first column, I would use the approach you suggested:对于第一列,我将使用您建议的方法:
df['T'] = df['Text'].str.endswith('...')
(Technically this will create a boolean column, not an integer column. You can use astype()
to convert if you care about this.) (从技术上讲,这将创建一个布尔列,而不是整数列。如果您关心这一点,可以使用
astype()
进行转换。)
For the second column, I would unconditionally replace:对于第二列,我将无条件替换:
df['Clean'] = df['Text'].str.replace(r'...$', '')
If it doesn't end in ..., it won't do anything.如果它不是以 ... 结尾,它就不会做任何事情。
In case you want to replace the "ending" ellipsis only on those text rows with that property:如果您只想用该属性替换那些文本行上的“结尾”省略号:
df.loc[df['Text'].str.endswith('...') == True, 'ends_in_ellipsis'] = 1
df.loc[df['ends_in_ellipsis'] == 1, 'Text_2'] = df.loc[df['ends_in_ellipsis'] == 1, 'Text'].str.rstrip('...')
Now if you want to do it all in one line (although less readable for others but you save a dummy column and the memory it takes up):现在,如果您想在一行中完成所有操作(虽然对其他人来说可读性较差,但您保存了一个虚拟列及其占用的内存):
df.loc[df['Text'].str.endswith('...') == True, 'Text_2'] = df.loc[df['Text'].str.endswith('...') == True, 'Text'].str.rstrip('...')
Let us try endswith
+ rstrip
让我们试试
endswith
+ rstrip
df['new1']=df.Text.str.endswith('...').astype(int)
df['new2']=df.Text.str.rstrip(' ...') # notice rstrip will not remove any ... in the mid
df
Text new1 new2
0 bla bla bla ... 1 bla bla bla
1 once upon a time 0 once upon a time
2 pretty little liars 0 pretty little liars
3 Batman ... 1 Batman
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.