简体   繁体   English

如何从数据框列的字符串中删除空格?

[英]how to remove white space from strings of data frame column?

I am trying to loop through a column in a pandas data frame to remove unnecessary white space in the beginning and end of the strings within the column.我正在尝试遍历 pandas 数据框中的列,以删除列中字符串开头和结尾的不必要空格。 My data frame looks like this:我的数据框如下所示:

df={'c1': [' ab', 'fg', 'ac ', 'hj-jk ', ' ac', 'df, gh', 'gh', 'ab', 'ad', 'jk-pl', 'ae', 'kl-kl '], 'b2': ['ba', 'bc', 'bd', 'be', 'be', 'be', 'ba'] }


    c1  b2
0   ab, fg
1   ac, hj-jk   
2   ac, df,gh   
3   gh, be
4   ab, be
5   ad, jk-pl
6   ae, kl-kl   

I tried the this answer here , but did not work either.我在这里尝试了这个答案,但也没有用。 The reason I need to remove the white space from the strings in this column is that I want to one hot encode this column using get.dummies() function.我需要从该列中的字符串中删除空格的原因是我想使用 get.dummies() 函数对该列进行一次热编码。 My idea was to use the strip() function to remove the white space from each value and then I use .str.get_dummies(','):我的想法是使用 strip() 函数从每个值中删除空格,然后我使用 .str.get_dummies(','):

#function to remove white space from strings
def strip_string(dataframe, column_name):
  for id, item in dataframe[column_name].items():
    a=item.strip()

#removing the white space from the values of the column
strip_string(df, 'c1')

#creating one hot-encoded columns from the values using split(",")

df1=df['c1'].str.get_dummies(',')

but my code returns duplicate columns and I don't want this...I suppose the function to remove the white space is not working well?但我的代码返回重复的列,我不希望这样......我想删除空格的功能不能正常工作? Can anyone help?任何人都可以帮忙吗? My current output is:我目前的输出是:

   ab   ac  df  fg  gh  hj-jk   jk-pl   kl-kl   ab  ac  ad  ae  gh
0   1   0   0   1   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   1   0   0   0   1   0   0   0
2   0   1   1   0   1   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   0   0   1
4   0   0   0   0   0   0   0   0   1   0   0   0   0
5   0   0   0   0   0   0   1   0   0   0   1   0   0
6   0   0   0   0   0   0   0   1   0   0   0   1   0

columns 'ac' and 'ab' are duplicated.列 'ac' 和 'ab' 重复。 I want to remove the duplicated columns我想删除重复的列

UPDATED:更新:

I think you need to handle spaces around commas as well as at the start/end of a string in order for Series.str.get_dummies() to work correctly for your example:我认为您需要处理逗号周围的空格以及字符串的开头/结尾,以便Series.str.get_dummies()为您的示例正常工作:

df = df.apply(lambda x: x.str.strip().str.replace(' *, *', ','))

Input:输入:

        c1   b2
0       ab  foo
1       fg  foo
2      ac   foo
3   hj-jk   foo
4       ac  foo
5   df, gh  foo
6       gh  foo
7       ab  foo
8       ad  foo
9    jk-pl  foo
10      ae  foo
11  kl-kl   foo

Intermediate dataframe (after removing spaces at start and end and adjacent to commas):中间数据框(在删除开头和结尾以及与逗号相邻的空格之后):

       c1   b2
0      ab  foo
1      fg  foo
2      ac  foo
3   hj-jk  foo
4      ac  foo
5   df,gh  foo
6      gh  foo
7      ab  foo
8      ad  foo
9   jk-pl  foo
10     ae  foo
11  kl-kl  foo

Output:输出:

    ab  ac  ad  ae  df  fg  gh  hj-jk  jk-pl  kl-kl
0    1   0   0   0   0   0   0      0      0      0
1    0   0   0   0   0   1   0      0      0      0
2    0   1   0   0   0   0   0      0      0      0
3    0   0   0   0   0   0   0      1      0      0
4    0   1   0   0   0   0   0      0      0      0
5    0   0   0   0   1   0   1      0      0      0
6    0   0   0   0   0   0   1      0      0      0
7    1   0   0   0   0   0   0      0      0      0
8    0   0   1   0   0   0   0      0      0      0
9    0   0   0   0   0   0   0      0      1      0
10   0   0   0   1   0   0   0      0      0      0
11   0   0   0   0   0   0   0      0      0      1

If you just use strip() (as in my earlier answer below), you will get something like this with a duplicate for gh :如果您只使用strip() (如下面我之前的回答),您将得到类似这样的内容,其中包含gh的副本:

     gh  ab  ac  ad  ae  df  fg  gh  hj-jk  jk-pl  kl-kl
0     0   1   0   0   0   0   0   0      0      0      0
1     0   0   0   0   0   0   1   0      0      0      0
2     0   0   1   0   0   0   0   0      0      0      0
3     0   0   0   0   0   0   0   0      1      0      0
4     0   0   1   0   0   0   0   0      0      0      0
5     1   0   0   0   0   1   0   0      0      0      0
6     0   0   0   0   0   0   0   1      0      0      0
7     0   1   0   0   0   0   0   0      0      0      0
8     0   0   0   1   0   0   0   0      0      0      0
9     0   0   0   0   0   0   0   0      0      1      0
10    0   0   0   0   1   0   0   0      0      0      0
11    0   0   0   0   0   0   0   0      0      0      1

EARLIER ANSWER:较早的答案:

Either of the following should work:以下任何一项都应该有效:

df = df.applymap(lambda x: x.strip())

... or: ... 或者:

df = df.apply(lambda x: x.str.strip())

I would stack , strip , get_dummies , and groupby.max :我会stackstripget_dummiesgroupby.max

If the separator is ', ' :如果分隔符是', '

df.stack().str.strip().str.get_dummies(sep=', ').groupby(level=0).max()

else:别的:

df.stack().str.replace(r'\s', '', regex=True).str.get_dummies(sep=',').groupby(level=0).max()

output:输出:

   ab  ac  ba  bc  bd  be  df  fg  gh  hj-jk
0   1   0   1   0   0   0   0   0   0      0
1   0   0   0   1   0   0   0   1   0      0
2   0   1   0   0   1   0   0   0   0      0
3   0   0   0   0   0   1   0   0   0      1
4   0   1   0   0   0   1   0   0   0      0
5   0   0   0   0   0   1   1   0   1      0
6   0   0   1   0   0   0   0   0   1      0

See if this helps:看看这是否有帮助:

import numpy as np
import pandas as pd
data={'c1': [' ab ', 'fg', 'ac ', 'hj-jk '], 'b2': ['ba', 'bc', 'bd', 'be'] }
df=pd.DataFrame(data)
print(df.head())
df=df.apply(lambda x: x.map(str.strip))
print(df.head())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM