简体   繁体   中英

Create series of tuples from pandas DataFrame efficiently

I am using apply() to construct a Series of tuples from the values of an existing DataFrame. I need to construct a specific order of the values in the tuple, and replace NaN in all but one column with '{}' .

The following functions work to produce the desired result, but the execution is rather slow:

def build_insert_tuples_series(row):
    # Here I attempt to handle ordering the final tuple
    # I must also replace NaN with "{}" for all but v2 column.
    vals = [row['v2']]
    row_sans_v2 = row.drop(labels=['v2'])
    row_sans_v2.fillna("{}", inplace=True)
    res = [val for val in row_sans_token]
    vals += res
    return tuple(vals)

def generate_insert_values_series(df):
    df['insert_vals'] = df.apply(lambda x: build_insert_tuples_series(x), axis=1)
    return df['insert_vals']

Original DataFrame:

    id   v1    v2
0  1.0  foo  quux
1  2.0  bar   foo
2  NaN  NaN   baz

Resulting DataFrame upon calling generate_insert_values_series(df) :

The logic for order on the final tuple is (v2, ..all_other_columns..)

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

Timing the function to generate the resulting DataFrame:

%%timeit
generate_insert_values_series(df)
100 loops, best of 3: 2.69 ms per loop

I feel that there may be a way to more efficiently construct the Series, but am unsure of how to optimize the operation using vectorization, or another approach.

zip , get , mask , fillna , and sorted

One liner for what it's worth

df.assign(
    insert_vals=
    [*zip(*map(df.mask(df.isna(), {}).get, sorted(df, key=lambda x: x != 'v2')))])

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

Less one-liner-ish

get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)

df.assign(insert_vals=[*zip(*map(get, cols))])

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

This should work for legacy python

get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)

df.assign(insert_vals=zip(*map(get, cols)))

First you can use numpy to replace null values by the dicts

import pandas as pd
import numpy as np

temp = pd.DataFrame({'id':[1,2, None], 'v1':['foo', 'bar', None], 'v2':['quux', 'foo', 'bar']})

def replace_na(col): 
    return np.where(temp[col].isnull(), '{}', temp[col])

def generate_tuple(df):
    df['id'], df['v1'] = replace_na('id'), replace_na('v1')
    return df.apply(lambda x: tuple([x['v2'], x['id'], x['v1']]), axis=1)

Your gain is

%%timeit
temp['insert_tuple'] = generate_tuple(temp)
>>>> 1000 loops, best of 3 : 1ms per loop

If you change the generate_tuple return to something like

def generate_tuple(df):
    df['id'], df['v1'] = replace_na('id'), replace_na('v1')
    return list(zip(df['v2'], df['id'], df['v1']))

your gain becomes:

%%timeit
temp['insert_tuple'] = generate_tuple(temp)
1000 loops, best of 3 : 674 µs per loop

You shouldn't want to do this, as your new series will lose all vectorised functionality.

But, if you must, you can avoid apply here by using either pd.DataFrame.itertuples , a list comprehension, or map . The only complication is reordering columns, which you can do via conversion to list :

df = pd.concat([df]*10000, ignore_index=True)

col_lst = df.columns.tolist()
cols = [col_lst.pop(col_lst.index('v2'))] + col_lst

%timeit list(df[cols].itertuples(index=False))  # 31.3 ms per loop
%timeit [tuple(x) for x in df[cols].values]     # 74 ms per loop
%timeit list(map(tuple, df[cols].values))       # 73 ms per loop

Benchmarking above is on Python 3.6.0, but you are likely to find these more efficient than apply even on 2.7. Note that list conversion is not necessary for the final version, since map returns a list in v2.7.

If absolutely necessary, you can fillna via a series:

s = pd.Series([{} for _ in range(len(df.index))], index=df.index)

for col in df[cols]:
    df[cols].fillna(s)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM