简体   繁体   English

如何在 dataframe 中使用 strip 到 substring?

[英]How to use strip to substring in dataframe?

I have a dataset with 100,000 rows and 300 columns我有一个包含 100,000 行和 300 列的数据集

Here is the sample dataset:这是示例数据集:

    EVENT_DTL
0   8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family

1   8. Background : Engineer / living with his mom marriage_married

How can I remove the white blank between 'with' and 'marriage_virgin' but leave only one white blank?我怎样才能去掉'with'和'marriage_virgin'之间的白色空白,只留下一个白色空白?

Desired outout would be:所需的输出将是:

        EVENT_DTL
    0   8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
    
    1   8. Background : Engineer / living with his mom marriage_married

You can use pandas.Series.str to replace "\s+" (1 or more whitespace) by a single whitespace.您可以使用pandas.Series.str"\s+" (1 个或多个空格)替换为单个空格。

Try this:试试这个:

df["EVENT_DTL"]= df["EVENT_DTL"].str.replace("\s+", " ", regex=True)

Output: Output:

print(df)
                                                                                                   EVENT_DTL
0  8. Background : no job / living with marriage_virgin 9. Social status : doing pretty well with his family
1  8. Background : Engineer / living with his mom marriage_married

If you need to clean up the whole dataframe, use pandas.DataFrame.replace :如果需要清理整个 dataframe,请使用pandas.DataFrame.replace

df.astype(str).replace("\s+", " ", regex=True, inplace=True)

You can call string methods for a DataFrame column with您可以调用 DataFrame 列的字符串方法

df["EVENT_DTL"].str.strip()

but .strip() doesn't work, because it only removed extra characters from the start and end of the string.但是.strip()不起作用,因为它只从字符串的开头和结尾删除了多余的字符。 To remove all duplicate whitespaces you can use regex : 要删除所有重复的空格,您可以使用 regex

import re
import pandas as pd

d = {"EVENT_DTL": [
    "8. Background : no job / living with         marriage_virgin 9. Social status : doing pretty well with his family",
    "8. Background : Engineer / living with his mom marriage_married"
]}
df = pd.DataFrame(d)
pattern = re.compile(" +")
df["EVENT_DTL"] = df["EVENT_DTL"].apply(lambda x: pattern.sub(" ", x))
print(df["EVENT_DTL"][0])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM