[英]how to remove white space from strings of data frame column?
I am trying to loop through a column in a pandas data frame to remove unnecessary white space in the beginning and end of the strings within the column.我正在尝试遍历 pandas 数据框中的列,以删除列中字符串开头和结尾的不必要空格。 My data frame looks like this:
我的数据框如下所示:
df={'c1': [' ab', 'fg', 'ac ', 'hj-jk ', ' ac', 'df, gh', 'gh', 'ab', 'ad', 'jk-pl', 'ae', 'kl-kl '], 'b2': ['ba', 'bc', 'bd', 'be', 'be', 'be', 'ba'] }
c1 b2
0 ab, fg
1 ac, hj-jk
2 ac, df,gh
3 gh, be
4 ab, be
5 ad, jk-pl
6 ae, kl-kl
I tried the this answer here , but did not work either.我在这里尝试了这个答案,但也没有用。 The reason I need to remove the white space from the strings in this column is that I want to one hot encode this column using get.dummies() function.
我需要从该列中的字符串中删除空格的原因是我想使用 get.dummies() 函数对该列进行一次热编码。 My idea was to use the strip() function to remove the white space from each value and then I use .str.get_dummies(','):
我的想法是使用 strip() 函数从每个值中删除空格,然后我使用 .str.get_dummies(','):
#function to remove white space from strings
def strip_string(dataframe, column_name):
for id, item in dataframe[column_name].items():
a=item.strip()
#removing the white space from the values of the column
strip_string(df, 'c1')
#creating one hot-encoded columns from the values using split(",")
df1=df['c1'].str.get_dummies(',')
but my code returns duplicate columns and I don't want this...I suppose the function to remove the white space is not working well?但我的代码返回重复的列,我不希望这样......我想删除空格的功能不能正常工作? Can anyone help?
任何人都可以帮忙吗? My current output is:
我目前的输出是:
ab ac df fg gh hj-jk jk-pl kl-kl ab ac ad ae gh
0 1 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 1 0 0 0
2 0 1 1 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 1 0 0
6 0 0 0 0 0 0 0 1 0 0 0 1 0
columns 'ac' and 'ab' are duplicated.列 'ac' 和 'ab' 重复。 I want to remove the duplicated columns
我想删除重复的列
UPDATED:更新:
I think you need to handle spaces around commas as well as at the start/end of a string in order for Series.str.get_dummies()
to work correctly for your example:我认为您需要处理逗号周围的空格以及字符串的开头/结尾,以便
Series.str.get_dummies()
为您的示例正常工作:
df = df.apply(lambda x: x.str.strip().str.replace(' *, *', ','))
Input:输入:
c1 b2
0 ab foo
1 fg foo
2 ac foo
3 hj-jk foo
4 ac foo
5 df, gh foo
6 gh foo
7 ab foo
8 ad foo
9 jk-pl foo
10 ae foo
11 kl-kl foo
Intermediate dataframe (after removing spaces at start and end and adjacent to commas):中间数据框(在删除开头和结尾以及与逗号相邻的空格之后):
c1 b2
0 ab foo
1 fg foo
2 ac foo
3 hj-jk foo
4 ac foo
5 df,gh foo
6 gh foo
7 ab foo
8 ad foo
9 jk-pl foo
10 ae foo
11 kl-kl foo
Output:输出:
ab ac ad ae df fg gh hj-jk jk-pl kl-kl
0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0
4 0 1 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 1 0 0 0
6 0 0 0 0 0 0 1 0 0 0
7 1 0 0 0 0 0 0 0 0 0
8 0 0 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0
10 0 0 0 1 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 1
If you just use strip()
(as in my earlier answer below), you will get something like this with a duplicate for gh
:如果您只使用
strip()
(如下面我之前的回答),您将得到类似这样的内容,其中包含gh
的副本:
gh ab ac ad ae df fg gh hj-jk jk-pl kl-kl
0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 0 0 1 0 0 0 0 0 0 0 0
5 1 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 0 0 1 0 0 0
7 0 1 0 0 0 0 0 0 0 0 0
8 0 0 0 1 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 1 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 1
EARLIER ANSWER:较早的答案:
Either of the following should work:以下任何一项都应该有效:
df = df.applymap(lambda x: x.strip())
... or: ... 或者:
df = df.apply(lambda x: x.str.strip())
I would stack
, strip
, get_dummies
, and groupby.max
:我会
stack
、 strip
、 get_dummies
和groupby.max
:
If the separator is ', '
:如果分隔符是
', '
:
df.stack().str.strip().str.get_dummies(sep=', ').groupby(level=0).max()
else:别的:
df.stack().str.replace(r'\s', '', regex=True).str.get_dummies(sep=',').groupby(level=0).max()
output:输出:
ab ac ba bc bd be df fg gh hj-jk
0 1 0 1 0 0 0 0 0 0 0
1 0 0 0 1 0 0 0 1 0 0
2 0 1 0 0 1 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 1
4 0 1 0 0 0 1 0 0 0 0
5 0 0 0 0 0 1 1 0 1 0
6 0 0 1 0 0 0 0 0 1 0
See if this helps:看看这是否有帮助:
import numpy as np
import pandas as pd
data={'c1': [' ab ', 'fg', 'ac ', 'hj-jk '], 'b2': ['ba', 'bc', 'bd', 'be'] }
df=pd.DataFrame(data)
print(df.head())
df=df.apply(lambda x: x.map(str.strip))
print(df.head())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.