[英]Function to clean a Data Frame column with variable text
How can I set a function that can cleans through a given text passed into a Data Frame.我如何设置一个 function 来清除传递给数据框的给定文本。 The text will be my variable, so I can put whatever sentence and the function will clean it by applying lower case, removing characters, etc. My attempt goes like this:文本将是我的变量,所以我可以输入任何句子,function 将通过应用小写、删除字符等来清理它。我的尝试是这样的:
def my_function(x):
# Applies a few cleaning steps to the exceptions df:
# Sets text to lower case:
x.iloc[:, 0].str.lower()
# Removes breaks:
x.iloc[:, 0].replace(r'\n', ' ', regex=True)
# Sets text to lower case:
x.iloc[:, 0].str.lower()
# Removes a more extensive set of 'special' characters:
remove_these = ["!",'"',"%","&","'","(",")","#","*","?",
"+",",","-",".","/",":",";","<","=",">",
"@","[","\\","]","^","_","`","{","|","}",
"~","–","’", "*"]
for char in remove_these:
x.iloc[:, 0].str.replace(char, ' ')
# Removes numbers:
x.iloc[:, 0].replace(r'\d+', ' ', regex=True)
# Removes single characters:
x.iloc[:, 0].replace(r'\b[a-zA-Z]\b', ' ', regex=True)
# Removes extra spaces (trim) from both ends:
x.iloc[:, 0].str.strip()
# Removes double spacing:
x.iloc[:, 0].replace(r' +', ' ', regex=True)
# Removes spaces --:
x.iloc[:, 0].replace(r'--', '', regex=True)
Since the variable text would be passed into a DF, I thought using the first column always, hence the iloc[:, 0].由于可变文本将被传递到 DF,我认为总是使用第一列,因此使用 iloc[:, 0]。
Then my variable text would be set like this:然后我的可变文本将设置如下:
my_variable = "WHAT A WONDERFUL WORLD!"
df_Text = pd.DataFrame({my_variable})
But when applying this, it won't work, the output is 'None':但是当应用这个时,它不起作用,output 是“无”:
output = my_function(df_Text)
print(output)
What am I doing wrong?我究竟做错了什么? Thanks a lot.非常感谢。
Your function doesn't actually alter the dataframe in any way, and it doesn't return anything.你的 function 实际上并没有以任何方式改变 dataframe ,它也没有返回任何东西。
Try this.尝试这个。
mport pandas as pd
def my_function(x):
# Applies a few cleaning steps to the exceptions df:
# Sets text to lower case:
x.iloc[:, 0] = x.iloc[:, 0].str.lower()
# Removes breaks:
x.iloc[:, 0] = x.iloc[:, 0].replace(r'\n', ' ', regex=True)
# Sets text to lower case:
x.iloc[:, 0] = x.iloc[:, 0].str.lower()
# Removes a more extensive set of 'special' characters:
remove_these = ["!",'"',"%","&","'","(",")","#","*","?",
"+",",","-",".","/",":",";","<","=",">",
"@","[","\\","]","^","_","`","{","|","}",
"~","–","’", "*"]
for char in remove_these:
x.iloc[:, 0] = x.iloc[:, 0].str.replace(char, ' ')
# Removes numbers:
x.iloc[:, 0] = x.iloc[:, 0].replace(r'\d+', ' ', regex=True)
# Removes single characters:
x.iloc[:, 0] = x.iloc[:, 0].replace(r'\b[a-zA-Z]\b', ' ', regex=True)
# Removes extra spaces (trim) from both ends:
x.iloc[:, 0] = x.iloc[:, 0].str.strip()
# Removes double spacing:
x.iloc[:, 0] = x.iloc[:, 0].replace(r' +', ' ', regex=True)
# Removes spaces --:
x.iloc[:, 0] = x.iloc[:, 0].replace(r'--', '', regex=True)
return x
my_variable = "WHAT A WONDERFUL WORLD!"
df_Text = pd.DataFrame({my_variable})
output = my_function(df_Text)
print(output)
0
0 what wonderful world
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.