简体   繁体   中英

Get a sample of aggregated row values with pandas

I need a function that given a data frame and a number num constructs a data frame with num rows such that every row has the following value: - for columns with string values we sample a value from a column in original table - for columns with floats or ints we find mean value

Here is my code

def rows_aggr(df, num):
    dataframe = None
    for i in range(0, num):
        row = None
        for cname in df.columns.values:
            column = df[cname]
            dfcol = Series.to_frame(column)

            if column.dtype != np.number:
                item = dfcol.sample(n=1)
            else:
                item = dfcol.mean(axis=1)

            if row is None:
                row = item
            else:
                row = pd.concat([row, item], axis=1)

        if dataframe is None:
            dataframe = row
        else:
            dataframe = pd.concat([dataframe, row], axis=0)

    return dataframe

for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.

for

df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3

we would get smth like

c, 2.5
f, 2.5
b, 2.5

assuming and c, f, b were randomly picked

Thank you!

One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1) , item contains an index number that is not always the same and this add rows with Nan in row . Here is another way to do it.

SETUP

df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
                   'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
  col1 col2  col3       col4
0    a    i     0   6.666667
1    b    j     1   7.333333
2    c    k     2   8.000000
3    d    l     3   8.666667
4    e    m     4   9.333333
5    f    n     5  10.000000

you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:

def rows_aggr(df, num):
    list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
    return pd.DataFrame({col: df[col].sample(num).values
                              if col in list_col_notnumeric  
                              else df[col].mean() 
                         for col in df.columns})

print (rows_aggr(df, 3))
  col1 col2  col3      col4
0    d    i   2.5  8.333333
1    a    n   2.5  8.333333
2    c    j   2.5  8.333333

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM