简体   繁体   中英

Python pandas dataframe transformations with groupby, pivot and transpose

I do have a dataframe with two columns: date and bill_id . Dates range in date column is one year from 01-01-2017 to 30-12-2017. There are 1000 unique bill_ids . Each bill_id may occur at least once in bill_id column. The result is a DataFrame of size: 2 column, 1000000 rows...

     dt   |bill_id

01-01-2017 bill_1
01-01-2017 bill_2
02-01-2017 bill_1
02-01-2017 bill_3
03-01-2017 bill_4
03-01-2017 bill_4

so, some name_ids may occur on specific day while other not.

What I want to achieve is a dataframe in a format so all unique bill_ids are columns, all unique dates are rows and each bill_id has either 0 or 1 or 2 for corresponding day value where 0 = did not appear on that date yet, 1 appeared on that date, 2 did not appear on that date but existed before eg

if bill_id existed on 02-01-2017 then it would have 0 on 01-01-2017, 1 on 02-01-2017 and 2 on 03-01-2017 and 2 on all consequetive days.

I did it in few steps but the code does not scale more as it is slow:

def map_values(row, df_z, c):
    subs = df_z[[c, 'bill_id', 'date']].loc[df_z['date'] == row['dt']]
    if c not in subs['bill_id']:
        row[c] = max(subs[c].tolist())
    else:
        val = df_z[c].loc[(df_z['date'] == row['dt']) & (df_z['bill_id'] == c)].values
        assert len(val) == 1
        row[c] = val[0]
    return row


def map_to_one(x):
    bills_x = x['bill_id'].tolist()

    for b in bills_x:
        try:
            x[b].loc[x['bill_id'] == b] = 1
        except:
            pass
    return x


def replace_val(df_groupped, col):
    mask = df_groupped.loc[df_groupped['bill_id'] == col].index[df_groupped[col].loc[df_groupped['bill_id'] == col] == 1]

    min_dt = df_groupped.iloc[min(mask)]['date']
    max_dt = df_groupped.iloc[max(mask)]['date']

    df_groupped[col].loc[(df_groupped['date'] < min_dt)] = 0
    df_groupped[col].loc[(df_groupped['date'] >= min_dt) & (df_groupped['date'] <= max_dt)] = 1
    df_groupped[col].loc[(df_groupped['date'] > max_dt)] = 2
    return df_groupped


def reduce_cols(row):
    col_id = row['bill_id']
    row['val'] = row[col_id]
    return row


df = df.sort_values(by='date')
df = df[pd.notnull(df['bill_id'])]
bills = list(set(df['bill_id'].tolist()))

for col in bills:
    df[col] = 9

df_groupped = df.groupby('date')
df_groupped = df_groupped.apply(lambda x: map_to_one(x))
df_groupped = df_groupped.reset_index()
df_groupped.to_csv('groupped_in.csv', index=False)
df_groupped = pd.read_csv('groupped_in.csv')

for col in bills:
    df_groupped = replace_val(df_groupped, col)

df_groupped = df_groupped.apply(lambda row: reduce_cols(row), axis=1)
df_groupped.to_csv('out.csv', index=False)

cols = [x for x in df_groupped.columns if x not in ['index', 'date', 'bill_id', 'val']]
col_dt = sorted(list(set(df_groupped['date'].tolist())))
dd = {x:[0]*len(col_dt) for x in cols}
dd['dt'] = col_dt
df_mapped = pd.DataFrame(data=dd).set_index('dt').reset_index()

for c in cols:
    counter += 1
    df_mapped = df_mapped.apply(lambda row: map_values(row, df_groupped[[c, 'bill_id', 'date']], c), axis=1)

EDIT:

The answer from Joe is fine but I decided to go instead with other option:

  1. get date.min() and date.max()
  2. df_groupped = groupby bill_id
  3. df_groupped apply function in which I check date_x.min() and date_x.max() per group and I do compare date.min() with date_x.min() and date.max() with date_x.max() and in such way I know where is 0, 1 and 2 :)

I hope i understood which is your desired output.

First make a crosstab :

df1 = pd.crosstab(df['dt'],df['bill_id'])

Output:

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       1       1       0       0
02-01-2017       1       0       1       0
03-01-2017       0       0       0       2

From now you start to modify the df in this way: Create a copy that you will use as mask

df2 = df1.copy()

Replace the 0 after 1(or the other values>1):

for col in df2.columns:
    df2[col] = df2[col].replace(to_replace=0, method='ffill')

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       1       1       0       0
02-01-2017       1       1       1       0
03-01-2017       1       1       1       2

Now subtract the 2 df:

df3 = df1-df2

These are the changed values:

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       0       0       0       0
02-01-2017       0      -1       0       0
03-01-2017      -1      -1      -1       0

replace these values with 2:

for col in df3.columns:
    df3[col] = df3[col].replace(-1, 2)

Go back to the first df1 and change the values bigger than 1 to 1:

for col in df1.columns:
    df1[col] = df1[col].apply(lambda x: x if x < 2 else 1)

and in the end you sum this last df with df3:

df_add = df1.add(df3, fill_value=0)

Output:

    bill_id     bill_1  bill_2  bill_3  bill_4
dt                                        
01-01-2017       1       1       0       0
02-01-2017       1       2       1       0
03-01-2017       2       2       2       1

To complete, replace negative values:

for col in df_add.columns:
    df_add[col] = df_add[col].apply(lambda x: 2 if x < 0 else x)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM