检查数据帧中的结束字符并替换它们

Question

I would like to add two new columns in my pandas dataframe based on the following conditions我想根据以下条件在我的 Pandas 数据框中添加两个新列

if a sentence ends with '...' then add a new column with value 1, otherwise 0;如果句子以“...”结尾，则添加一个值为 1 的新列，否则为 0；
if a sentence ends with '...' then add a new column without '...' at the end如果句子以“...”结尾，则在末尾添加一个不带“...”的新列

Something like this:像这样的东西：

Text
bla bla bla ...
once upon a time
pretty little liars
Batman ...

Expected预期的

    Text                T    Clean
    bla bla bla ...     1    bla bla bla 
    once upon a time    0    once upon a time 
    pretty little liars 0    pretty little liars
    Batman ...          1    Batman

I tried to apply regex, but probably str.endwith would be a better approach to check if a sentence ends with ..., since assigns a boolean value (my T column).我尝试应用正则表达式，但 str.endwith 可能是检查句子是否以 ... 结尾的更好方法，因为分配了一个布尔值（我的 T 列）。

I have tried as follows: df['Text'].str.endswith('...') but I would need to create a new column with 1 and 0. For cleaning the text I would check if T is true: if it is true, I would remove the ... at the end.我试过如下： df['Text'].str.endswith('...')但我需要用 1 和 0 创建一个新列。为了清理文本，我会检查T是否为真：如果是的，我会在最后删除...

df['Clean'] = df['Text'].str.rstrip('...')

or df['Clean'] = df['Text'].str[:-3] (but it does not include any logical condition or information on ... )或df['Clean'] = df['Text'].str[:-3] （但它不包括任何逻辑条件或有关...信息）

or df['Clean'] = df['Text'].str.replace(r'...$', '')或df['Clean'] = df['Text'].str.replace(r'...$', '')

It is important that I consider the sentence ending with ... in order to avoid to delete ... in the middle of sentence which have a different meaning.重要的是我考虑以...结尾的句子，以避免删除...在句子中间具有不同含义。

Answer 1

For the first column, I would use the approach you suggested:对于第一列，我将使用您建议的方法：

df['T'] = df['Text'].str.endswith('...')

(Technically this will create a boolean column, not an integer column. You can use astype() to convert if you care about this.) （从技术上讲，这将创建一个布尔列，而不是整数列。如果您关心这一点，可以使用astype()进行转换。）

For the second column, I would unconditionally replace:对于第二列，我将无条件替换：

df['Clean'] = df['Text'].str.replace(r'...$', '')

If it doesn't end in ..., it won't do anything.如果它不是以 ... 结尾，它就不会做任何事情。

Answer 2

In case you want to replace the "ending" ellipsis only on those text rows with that property:如果您只想用该属性替换那些文本行上的“结尾”省略号：

df.loc[df['Text'].str.endswith('...') == True, 'ends_in_ellipsis'] = 1

df.loc[df['ends_in_ellipsis'] == 1, 'Text_2'] = df.loc[df['ends_in_ellipsis'] == 1, 'Text'].str.rstrip('...')

Now if you want to do it all in one line (although less readable for others but you save a dummy column and the memory it takes up):现在，如果您想在一行中完成所有操作（虽然对其他人来说可读性较差，但您保存了一个虚拟列及其占用的内存）：

df.loc[df['Text'].str.endswith('...') == True, 'Text_2'] = df.loc[df['Text'].str.endswith('...') == True, 'Text'].str.rstrip('...')

Answer 3

Let us try endswith + rstrip让我们试试endswith + rstrip

df['new1']=df.Text.str.endswith('...').astype(int)
df['new2']=df.Text.str.rstrip(' ...') # notice rstrip will not remove any ... in the mid 
df
                  Text  new1                 new2
0      bla bla bla ...     1          bla bla bla
1     once upon a time     0     once upon a time
2  pretty little liars     0  pretty little liars
3           Batman ...     1               Batman

检查数据帧中的结束字符并替换它们

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-10-05 19:32:36

解决方案2
1 2020-10-05 19:35:28

解决方案3
0 2020-10-05 19:37:45

检查数据帧中的结束字符并替换它们

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-10-05 19:32:36

解决方案2 1 2020-10-05 19:35:28

解决方案3 0 2020-10-05 19:37:45

解决方案1
2 已采纳 2020-10-05 19:32:36

解决方案2
1 2020-10-05 19:35:28

解决方案3
0 2020-10-05 19:37:45