如何在 dataframe 中使用 strip 到 substring？

Question

I have a dataset with 100,000 rows and 300 columns我有一个包含 100,000 行和 300 列的数据集

Here is the sample dataset:这是示例数据集：

    EVENT_DTL
0   8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family

1   8. Background : Engineer / living with his mom marriage_married

How can I remove the white blank between 'with' and 'marriage_virgin' but leave only one white blank?我怎样才能去掉'with'和'marriage_virgin'之间的白色空白，只留下一个白色空白？

Desired outout would be:所需的输出将是：

        EVENT_DTL
    0   8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
    
    1   8. Background : Engineer / living with his mom marriage_married

Answer 1

You can use pandas.Series.str to replace "\s+" (1 or more whitespace) by a single whitespace.您可以使用pandas.Series.str将"\s+" （1 个或多个空格）替换为单个空格。

Try this:试试这个：

df["EVENT_DTL"]= df["EVENT_DTL"].str.replace("\s+", " ", regex=True)

Output: Output：

print(df)
                                                                                                   EVENT_DTL
0  8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
1  8. Background : Engineer / living with his mom marriage_married

If you need to clean up the whole dataframe, use pandas.DataFrame.replace :如果需要清理整个 dataframe，请使用pandas.DataFrame.replace ：

df.astype(str).replace("\s+", " ", regex=True, inplace=True)

Answer 2

You can call string methods for a DataFrame column with您可以调用 DataFrame 列的字符串方法

df["EVENT_DTL"].str.strip()

but .strip() doesn't work, because it only removed extra characters from the start and end of the string.但是.strip()不起作用，因为它只从字符串的开头和结尾删除了多余的字符。 To remove all duplicate whitespaces you can use regex : 要删除所有重复的空格，您可以使用 regex ：

import re
import pandas as pd

d = {"EVENT_DTL": [
    "8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family",
    "8. Background : Engineer / living with his mom marriage_married"
]}
df = pd.DataFrame(d)
pattern = re.compile(" +")
df["EVENT_DTL"] = df["EVENT_DTL"].apply(lambda x: pattern.sub(" ", x))
print(df["EVENT_DTL"][0])

如何在 dataframe 中使用 strip 到 substring？

问题描述

2 个解决方案

解决方案1
4 已采纳 2022-12-01 10:52:46

解决方案2
1 2022-12-01 10:52:59

如何在 dataframe 中使用 strip 到 substring？

问题描述

2 个解决方案

解决方案1 4 已采纳 2022-12-01 10:52:46

解决方案2 1 2022-12-01 10:52:59

解决方案1
4 已采纳 2022-12-01 10:52:46

解决方案2
1 2022-12-01 10:52:59