简体   繁体   English

Function 使用可变文本清理数据框列

[英]Function to clean a Data Frame column with variable text

How can I set a function that can cleans through a given text passed into a Data Frame.我如何设置一个 function 来清除传递给数据框的给定文本。 The text will be my variable, so I can put whatever sentence and the function will clean it by applying lower case, removing characters, etc. My attempt goes like this:文本将是我的变量,所以我可以输入任何句子,function 将通过应用小写、删除字符等来清理它。我的尝试是这样的:

def my_function(x):
    # Applies a few cleaning steps to the exceptions df:
    # Sets text to lower case:
    x.iloc[:, 0].str.lower()
    # Removes breaks:
    x.iloc[:, 0].replace(r'\n', ' ', regex=True)
    # Sets text to lower case:
    x.iloc[:, 0].str.lower()
    # Removes a more extensive set of 'special' characters:
    remove_these = ["!",'"',"%","&","'","(",")","#","*","?",
                    "+",",","-",".","/",":",";","<","=",">",
                    "@","[","\\","]","^","_","`","{","|","}",
                    "~","–","’", "*"]
    for char in remove_these:
        x.iloc[:, 0].str.replace(char, ' ')
    # Removes numbers:
    x.iloc[:, 0].replace(r'\d+', ' ', regex=True)
    # Removes single characters:
    x.iloc[:, 0].replace(r'\b[a-zA-Z]\b', ' ', regex=True)
    # Removes extra spaces (trim) from both ends:
    x.iloc[:, 0].str.strip()
    # Removes double spacing:
    x.iloc[:, 0].replace(r' +', ' ', regex=True)
    # Removes spaces --:
    x.iloc[:, 0].replace(r'--', '', regex=True)

Since the variable text would be passed into a DF, I thought using the first column always, hence the iloc[:, 0].由于可变文本将被传递到 DF,我认为总是使用第一列,因此使用 iloc[:, 0]。

Then my variable text would be set like this:然后我的可变文本将设置如下:

my_variable = "WHAT A WONDERFUL WORLD!"
df_Text = pd.DataFrame({my_variable})

But when applying this, it won't work, the output is 'None':但是当应用这个时,它不起作用,output 是“无”:

output = my_function(df_Text)
print(output)

What am I doing wrong?我究竟做错了什么? Thanks a lot.非常感谢。

Your function doesn't actually alter the dataframe in any way, and it doesn't return anything.你的 function 实际上并没有以任何方式改变 dataframe ,它也没有返回任何东西。

Try this.尝试这个。

mport pandas as pd

def my_function(x):
    # Applies a few cleaning steps to the exceptions df:
    # Sets text to lower case:
    x.iloc[:, 0] = x.iloc[:, 0].str.lower()
    # Removes breaks:
    x.iloc[:, 0] = x.iloc[:, 0].replace(r'\n', ' ', regex=True)
    # Sets text to lower case:
    x.iloc[:, 0]  = x.iloc[:, 0].str.lower()
    # Removes a more extensive set of 'special' characters:
    remove_these = ["!",'"',"%","&","'","(",")","#","*","?",
                    "+",",","-",".","/",":",";","<","=",">",
                    "@","[","\\","]","^","_","`","{","|","}",
                    "~","–","’", "*"]
    for char in remove_these:
        x.iloc[:, 0] = x.iloc[:, 0].str.replace(char, ' ')
    # Removes numbers:
    x.iloc[:, 0] = x.iloc[:, 0].replace(r'\d+', ' ', regex=True)
    # Removes single characters:
    x.iloc[:, 0] = x.iloc[:, 0].replace(r'\b[a-zA-Z]\b', ' ', regex=True)
    # Removes extra spaces (trim) from both ends:
    x.iloc[:, 0] = x.iloc[:, 0].str.strip()
    # Removes double spacing:
    x.iloc[:, 0] = x.iloc[:, 0].replace(r' +', ' ', regex=True)
    # Removes spaces --:
    x.iloc[:, 0] = x.iloc[:, 0].replace(r'--', '', regex=True)
    return x

my_variable = "WHAT A WONDERFUL WORLD!"
df_Text = pd.DataFrame({my_variable})

output = my_function(df_Text)
print(output)
                      0
0  what wonderful world

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM