简体   繁体   中英

How to create column in dataframe with list of headings affected by condition, apply a cap and then exclude not respecting condition headings

I'm struggling to solve this issue. Help would be very much appreciated.

Note: bold in the text refers to the columns i need to create.

I have a data set in which I count the values of the row that are different than nan, and it's represented in column [count]. In column [incl_count] i would like to have lists which identify the headings of the columns contributing to the count. Next, I would like to have a limitation [lim] column in which I cannot have more than 3 counts. There is a cap of maximum 3. This means that the last columns to arrive to the counting cannot be considering and therefore excluded, being the exclusion saved in column [excl]

[index]     [A]   [B]   [C]    [D]    [E]    [F]  [count] [incl_count]    [lim]  [excl]
   ...
   ...
   ...

2020-01-01  nan    nan   nan   nan    nan    nan     0      []             0       []
2020-01-02 -0.01   nan   nan   nan    nan    nan     1      [A]            1       []
2020-01-03  0.02   nan   nan   nan    nan    nan     1      [A]            1       []
2020-01-04 -0.01   0.01  nan   nan    nan    nan     2      [A,B]          2       []
2020-01-05 -0.02  -0.04  0.02  nan    nan    nan     3      [A,B,C]        3       []
2020-01-06  nan    0.02  0.03  0.02   0.01   nan     4      [B,C,D,E]      3       [E]
2020-01-07  nan   -0.02  0.01  -0.01  0.03   0.01    5      [B,C,D,E,F]    3       [E,F]
2020-01-08  nan    nan  -0.02  0.05   -0.05  0.02    4      [C,D,E,F]      2       [E,F]
2020-01-09  nan    nan   nan   0.02   0.02   0.05    3      [D,E,F]        1       [E,F]
2020-01-10  nan    nan   nan    nan   nan    0.01    1      [F]            0       [F]
   ...
   ...
   ...

This should work:

import pandas as pd
import numpy as np

non_value_columns = ["index", "incl_count", "excl", "lim", "count"]
max_lim = 3
entries = []
df = pd.read_excel('your.xlsx')
for entry in df:
    if entry not in non_value_columns:
        print(entry)
        entries.append(entry)

indexes = df['index'].tolist()

i = 0
cur_excludes = []
for index in indexes:
    c = 0
    incl = []
    excl = []
    for entry in entries:
        if not np.isnan(df[entry].tolist()[i]):
            incl.append(entry)
            c += 1
            if max_lim < c or entry in cur_excludes:
                c -= 1
                excl.append(entry)
                cur_excludes.append(entry)
    df.loc[i, 'lim'] = str(c)
    df.loc[i, 'incl_count'] = str(incl)
    df.loc[i, 'excl'] = str(excl)
    i += 1
df.to_excel('output.xlsx')

Edit: Changed code so it would loop through all the different columns. Made an array where you can state the columns that are nonvalue columns, make sure you extend it if you add columns that you do not want to check it is name-based so just add the name of the column. Also made a variable where you can state your limit. Hope this works tell me if anything goes wrong!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM