简体   繁体   中英

How to append columns based on other column values to pandas dataframe

I have the following problem: I want to append columns to a dataframe. These columns are the unique values in another row of this dataframe, filled with the occurence of this value in this row. It looks like this:

df:

   Column1  Column2
0     1       a,b,c
1     2       a,e
2     3       a
3     4       c,f
4     5       c,f

What I am trying to get is:

    Column1  Column2  a  b  c  e  f
0     1       a,b,c   1  1  1
1     2       a,e     1        1
2     3       a       1
3     4       c,f           1     1
4     5       c,f           1     1

(the empty spaces can be nan or 0, it matters not.)

I have now written some code to aceive this, but instead of appending columns, it appends rows, so that my output looks like this:

        Column1  Column2
    0     1       a,b,c
    1     2       a,e
    2     3       a
    3     4       c,f
    4     5       c,f
    a     1        1
    b     1        1
    c     1        1
    e     1        1
    f     1        1

The code looks like this:

def NewCols(x):
    for i, value in df['Column2'].iteritems():
        listi=value.split(',')
        for value in listi:
            string = value
            x[string]=list.count(string)
    return x

df1=df.apply(NewCols)

What I am trying to do here is to iterate through each row of the dataframe and split the string (a,b,c) contained in Column2 at comma, so the variable listi is then a list containing the separated string values. For each of this values I then want to make a new column and fill it with the number of occurences of that value in listi . I am confused why the code appends rows instead of columns. Does somebody know why and how I can correct that?

While we could do this using get_dummies , we can also cheat and use pd.value_counts directly:

>>> df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'Column2': {0: 'a,b,c', 1: 'a,e', 2: 'a', 3: 'c,f', 4: 'c,f'}})
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
   Column1 Column2  a  b  c  e  f
0        1   a,b,c  1  1  1  0  0
1        2     a,e  1  0  0  1  0
2        3       a  1  0  0  0  0
3        4     c,f  0  0  1  0  1
4        5     c,f  0  0  1  0  1

Step-by-step, we have

>>> df.Column2.str.split(",")
0    [a, b, c]
1       [a, e]
2          [a]
3       [c, f]
4       [c, f]
dtype: object
>>> df.Column2.str.split(",").apply(pd.value_counts)
    a   b   c   e   f
0   1   1   1 NaN NaN
1   1 NaN NaN   1 NaN
2   1 NaN NaN NaN NaN
3 NaN NaN   1 NaN   1
4 NaN NaN   1 NaN   1
>>> df.Column2.str.split(",").apply(pd.value_counts).fillna(0)
   a  b  c  e  f
0  1  1  1  0  0
1  1  0  0  1  0
2  1  0  0  0  0
3  0  0  1  0  1
4  0  0  1  0  1
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
   Column1 Column2  a  b  c  e  f
0        1   a,b,c  1  1  1  0  0
1        2     a,e  1  0  0  1  0
2        3       a  1  0  0  0  0
3        4     c,f  0  0  1  0  1
4        5     c,f  0  0  1  0  1

When you use apply , it calls your function once for each column, with that column as an argument. So x in your NewCols will be set to a single column. When you do x[string] = list.count(string) , you are adding values to that column. Since apply is called for each column, you wind up appending the values to both columns in this way.

apply is not the right choice when your computation depends only on the values of a single column. Instead, use map . In this case, what you need to do is write a NewCol function that accepts a single Column2 value and returns the data for a single row. You can return this as a dict, or, handily, a dict-like object such as a collections.Counter . Then you need to wrap this new row data into a DataFrame and attach it column-wise to your existing data using concat . Here is an example:

def NewCols(val):
    return collections.Counter(val.split(','))

>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.map(NewCols))], axis=1)
   Column1 Column2   a   b   c   e   f
0        1   a,b,c   1   1   1 NaN NaN
1        2     a,e   1 NaN NaN   1 NaN
2        3       a   1 NaN NaN NaN NaN
3        4     c,f NaN NaN   1 NaN   1
4        5     c,f NaN NaN   1 NaN   1

For this particular computation, you actually don't need to write your own function at all, because pandas has split built in as an operation under the .str method accessor. So you can do this:

>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.str.split(',').map(collections.Counter))], axis=1)
   Column1 Column2   a   b   c   e   f
0        1   a,b,c   1   1   1 NaN NaN
1        2     a,e   1 NaN NaN   1 NaN
2        3       a   1 NaN NaN NaN NaN
3        4     c,f NaN NaN   1 NaN   1
4        5     c,f NaN NaN   1 NaN   1

You could use something as:

import pandas as pd
import sklearn.feature_extraction.text

vect = sklearn.feature_extraction.text.CountVectorizer(binary=True,   token_pattern=u'(?u)\\b\\w+\\b')
df = ...
v = [a for a in df['Column2']]
new_df = df.combine_first( pd.DataFrame(vect.fit_transform(v).todense(), columns=vect.get_feature_names()) )
print new_df

Cheers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM