简体   繁体   中英

Build list from column names Pandas DataFrame

So I am operating on a rather large set of data. I am usign Pandas DataFrame to handle this data and am stuck on an efficient way to parse the data into two formatted lists

HERE IS MY DATAFRAME OBJECT

            fet1    fet2    fet3    fet4    fet5
stim1       True    True    False   False   False
stim2       True    False   False   False   True
stim3       ...................................
stim4       ...................................
stim5       ............................. so on

I am trying to parse each row and create two lists. List one should have the column name of all the true values. List two should have the column names of the false values.

example for stim 1:

list_1=[fet1,fet2]   
list_2=[fet3,fet4,fet5]

I know I can brute force this approach and Iterate over the rows. Or I can transpose and convert to a dictionary and Parse that Way. I can also create Sparse Series objects and then create sets but then have to reference the column names separately.

The only problem I am running into is that I am always getting Quadratic O(n^2) run time.

Is there a more efficient way to do this as a built in functionality from Pandas?

Thanks for your help.

Is this what you want?

>>> df
       fet1   fet2   fet3   fet4   fet5
stim1  True   True  False  False  False
stim2  True   False False  False   True
>>> def func(row):
        return [
            row.index[row == True], 
            row.index[row == False]
        ]
>>> df.apply(func, axis=1)
stim1    [[fet1, fet2], [fet3, fet4, fet5]]
stim2    [[fet1, fet5], [fet2, fet3, fet4]]
dtype: object

This may or may not be faster. I do not think a more succinct solution is possible.

Fast (not row-by-row) operations can get this far.

In [126]: (np.array(df.columns)*~df)[~df]
Out[126]: 
      fet1  fet2  fet3  fet4  fet5
stim1  NaN   NaN  fet3  fet4  fet5
stim2  NaN  fet2  fet3  fet4   NaN

But at this point, because the rows might have variable length, the array structure must be broken and each row must be considered individually.

In [122]: (np.array(df.columns)*df)[df].apply(lambda x: Series([x.dropna()]), 1)
Out[122]: 
                  0
stim1  [fet1, fet2]
stim2  [fet1, fet5]

In [125]: (np.array(df.columns)*~df)[~df].apply(lambda x: Series([x.dropna()]), 1)
Out[125]: 
                    0
stim1  [fet3, fet4, fet5]
stim2  [fet2, fet3, fet4]

The slowest step is probably the Series constructor. I'm pretty sure there's no way around it though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM