将功能应用于数据框列

Question

I have a pandas dataframe: 我有一个熊猫数据框：

  name    sample
1  a      Category 1: qwe, asd (line break) Category 2: sdf, erg
2  b      Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30  p      Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err

I want to end with: 我想以：

 name    qwe   asd   sdf   erg   zxc   eru 2134  EFDgh  Pdr tke  err
1  a       1     1     1     1    0     0    0     0       0       0
2  b       0     0     1     1    1     1    0     0       0       0
...
30  p      0    1      0     0    0     0    0     1       1       0

I created the following function: 我创建了以下函数：

def cleanattributes(istring):

    istring=str(istring)
    istring=istring.rstrip().split('\\n')

    counter=0
    for line in istring:
        istring[counter]=istring[counter].rpartition(': ')[-1]
        counter+=1
    istring=str(istring)
    istring = istring.replace("'", "")
    istring = istring.replace("\"", "")
    return(str(istring))

This function creates the expected result of returning the category information without the category titles(the idea being to use getdummies to get the columns) 此函数创建不带类别标题的返回类别信息的预期结果（想法是使用getdummies获取列）

teststring="Category 1: qwe, asd\\nCategory 2: sdf, erg"
cleanattributes(teststring)
OUTPUT: '[qwe, asd, sdf, erg]'

I'm not sure how to best apply this function to each record so that the dataframe looks like this: 我不确定如何最好地将此功能应用于每个记录，以便数据框看起来像这样：

  name    sample
1  a      qwe, asd, sdf, erg
2  b      sdf, erg, zxc, eru
...
30  p      asd, 2134, EFDgh, Pdr tke, err

Or if this is even the best method of approaching this. 或者，即使这是解决此问题的最佳方法。

As requested: 按照要求：

df['sample'].iat[0]
OUTPUt= 'Category 1: qwe, asd\nCategory 2: sdf, erg'

Answer 1

df = pd.DataFrame(
    {'name': ['a', 'b'],
     'sample': ['Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err', 
                'Category 2: sdf, erg\nCategory 5: zxc, eru\nCategory 1: asd, Category PE: 2134, EFDgh, Pdr tke, err']}

df2 = pd.concat([df.name, 
                 df['sample']
                 .str.replace("(Category .*: )+", '')  # Remove "Category [*]:"
                 .str.replace(r'\n', '')  # Remove "\n"
                 .str.split(', ', expand=True)], 
                axis=1)

df3 = pd.melt(df2, id_vars='name')[['name', 'value']]

>>> pd.concat([df3['name'], pd.get_dummies(df3['value'])], axis=1)
   name  2134  EFDgh  Pdr tke  ergzxc  err  eru2134  sdf
0     a     1      0        0       0    0        0    0
1     b     0      0        0       0    0        0    1
2     a     0      1        0       0    0        0    0
3     b     0      0        0       1    0        0    0
4     a     0      0        1       0    0        0    0
5     b     0      0        0       0    0        1    0
6     a     0      0        0       0    1        0    0
7     b     0      1        0       0    0        0    0
8     a     0      0        0       0    0        0    0
9     b     0      0        1       0    0        0    0
10    a     0      0        0       0    0        0    0
11    b     0      0        0       0    1        0    0

将功能应用于数据框列

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-04-05 21:31:47

将功能应用于数据框列

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-04-05 21:31:47

解决方案1
2 已采纳 2016-04-05 21:31:47