I am using apply()
to construct a Series of tuples from the values of an existing DataFrame. I need to construct a specific order of the values in the tuple, and replace NaN
in all but one column with '{}'
.
The following functions work to produce the desired result, but the execution is rather slow:
def build_insert_tuples_series(row):
# Here I attempt to handle ordering the final tuple
# I must also replace NaN with "{}" for all but v2 column.
vals = [row['v2']]
row_sans_v2 = row.drop(labels=['v2'])
row_sans_v2.fillna("{}", inplace=True)
res = [val for val in row_sans_token]
vals += res
return tuple(vals)
def generate_insert_values_series(df):
df['insert_vals'] = df.apply(lambda x: build_insert_tuples_series(x), axis=1)
return df['insert_vals']
Original DataFrame:
id v1 v2
0 1.0 foo quux
1 2.0 bar foo
2 NaN NaN baz
Resulting DataFrame upon calling generate_insert_values_series(df)
:
The logic for order on the final tuple is (v2, ..all_other_columns..)
id v1 v2 insert_vals
0 1.0 foo quux (quux, 1.0, foo)
1 2.0 bar foo (foo, 2.0, bar)
2 NaN NaN baz (baz, {}, {})
Timing the function to generate the resulting DataFrame:
%%timeit
generate_insert_values_series(df)
100 loops, best of 3: 2.69 ms per loop
I feel that there may be a way to more efficiently construct the Series, but am unsure of how to optimize the operation using vectorization, or another approach.
zip
, get
, mask
, fillna
, and sorted
One liner for what it's worth
df.assign(
insert_vals=
[*zip(*map(df.mask(df.isna(), {}).get, sorted(df, key=lambda x: x != 'v2')))])
id v1 v2 insert_vals
0 1.0 foo quux (quux, 1.0, foo)
1 2.0 bar foo (foo, 2.0, bar)
2 NaN NaN baz (baz, {}, {})
Less one-liner-ish
get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)
df.assign(insert_vals=[*zip(*map(get, cols))])
id v1 v2 insert_vals
0 1.0 foo quux (quux, 1.0, foo)
1 2.0 bar foo (foo, 2.0, bar)
2 NaN NaN baz (baz, {}, {})
This should work for legacy python
get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)
df.assign(insert_vals=zip(*map(get, cols)))
First you can use numpy
to replace null
values by the dicts
import pandas as pd
import numpy as np
temp = pd.DataFrame({'id':[1,2, None], 'v1':['foo', 'bar', None], 'v2':['quux', 'foo', 'bar']})
def replace_na(col):
return np.where(temp[col].isnull(), '{}', temp[col])
def generate_tuple(df):
df['id'], df['v1'] = replace_na('id'), replace_na('v1')
return df.apply(lambda x: tuple([x['v2'], x['id'], x['v1']]), axis=1)
Your gain is
%%timeit
temp['insert_tuple'] = generate_tuple(temp)
>>>> 1000 loops, best of 3 : 1ms per loop
If you change the generate_tuple return
to something like
def generate_tuple(df):
df['id'], df['v1'] = replace_na('id'), replace_na('v1')
return list(zip(df['v2'], df['id'], df['v1']))
your gain becomes:
%%timeit
temp['insert_tuple'] = generate_tuple(temp)
1000 loops, best of 3 : 674 µs per loop
You shouldn't want to do this, as your new series will lose all vectorised functionality.
But, if you must, you can avoid apply
here by using either pd.DataFrame.itertuples
, a list comprehension, or map
. The only complication is reordering columns, which you can do via conversion to list
:
df = pd.concat([df]*10000, ignore_index=True)
col_lst = df.columns.tolist()
cols = [col_lst.pop(col_lst.index('v2'))] + col_lst
%timeit list(df[cols].itertuples(index=False)) # 31.3 ms per loop
%timeit [tuple(x) for x in df[cols].values] # 74 ms per loop
%timeit list(map(tuple, df[cols].values)) # 73 ms per loop
Benchmarking above is on Python 3.6.0, but you are likely to find these more efficient than apply
even on 2.7. Note that list
conversion is not necessary for the final version, since map
returns a list
in v2.7.
If absolutely necessary, you can fillna
via a series:
s = pd.Series([{} for _ in range(len(df.index))], index=df.index)
for col in df[cols]:
df[cols].fillna(s)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.