简体   繁体   中英

Pandas sampling a dataframe but treating multiple rows as a single row based on column

Consider the following toy code that performs a simplified version of my actual question:

import pandas

df = pandas.DataFrame(
    {
        'n_event':     [1,2,3,4,5],
        'some column': [0,1,2,3,4],
    }
)

df = df.set_index(['n_event'])
print(df)

resampled_df = df.sample(frac=1, replace=True)
print(resampled_df)

The resampled_df is, as it name suggests, a resampled version of the original one (with replacement). This is exactly what I want. An example output of the previous code is

         some column
n_event             
1                  0
2                  1
3                  2
4                  3
5                  4
         some column
n_event             
4                  3
1                  0
4                  3
4                  3
2                  1

Now for my actual question I have the following dataframe:

import pandas

df = pandas.DataFrame(
    {
        'n_event':     [1,1,2,2,3,3,4,4,5,5],
        'n_channel':   [1,2,1,2,1,2,1,2,1,2],
        'some column': [0,1,2,3,4,5,6,7,8,9],
    }
)

df = df.set_index(['n_event','n_channel'])
print(df)

which looks like

                   some column
n_event n_channel             
1       1                    0
        2                    1
2       1                    2
        2                    3
3       1                    4
        2                    5
4       1                    6
        2                    7
5       1                    8
        2                    9

I want to do exactly the same as before, resample with replacements, but treating each group of rows with the same n_event as a single entity. A hand-built example of what I want to do can look like this:

                   some column
n_event n_channel             
2       1                    2
        2                    3
2       1                    2
        2                    3
3       1                    4
        2                    5
1       1                    0
        2                    1
5       1                    8
        2                    9

As seen, each n_event was treated as a whole and things within each event were no mixed up.

How can I do this without proceeding by brute force (ie without for loops, etc)?

I have tried with df.sample(frac=1, replace=True, ignore_index=False) and a few things using group_by without success.

Would a pivot() / melt() sequence work for you?

Use pivot() to from long to wide (make each group a single row).
Do the sampling.
Then back from wide to long using melt() .

Don't have time to work out a full answer but thought I would get this idea to you in case it might help you.

Following the suggestion of jch I was able to find a solution by combining pivot and stack :

import pandas

df = pandas.DataFrame(
    {
        'n_event':     [1,1,2,2,3,3,4,4,5,5],
        'n_channel':   [1,2,1,2,1,2,1,2,1,2],
        'some column': [0,1,2,3,4,5,6,7,8,9],
        'other col':   [5,6,4,3,2,5,2,6,8,7],
    }
)

resampled_df = df.pivot(
    index = 'n_event',
    columns = 'n_channel',
    values = set(df.columns) - {'n_event','n_channel'},
)
resampled_df = resampled_df.sample(frac=1, replace=True)
resampled_df = resampled_df.stack()
print(resampled_df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM