简体   繁体   中英

Vectorized way for applying a function to a dataframe to create lists

I have seen few questions like these

Vectorized alternative to iterrows , Faster alternative to iterrows , Pandas: Alternative to iterrow loops , for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance . But it seems like everyone is a unique case rather a generalized approach.

My questions is also again about .iterrows .

I am trying to pass the first and second row to a function and create a list out of it.

What I have:

I have a pandas DataFrame with two columns that look like this.

         I.D         Score
1         11          26
3         12          26
5         13          26
6         14          25

What I did:

where the term Point is a function I earlier defined.

my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()]

What I am trying to do:

The faster and vectorized form of the above.

The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column.

You can use pandas.DataFrame.apply with axis set to 1 :

df.apply(func, axis=1)

To put in a list, it depends what your function returns but you could:

df.apply(Points, axis=1).tolist()

If you want to apply on only some columns:

df[['Score', 'I.D']].apply(Points, axis=1)

If you want to apply on a func that takes multiple args use numpy.vectorize for speed:

np.vectorize(Points)(df['Score'], df['I.D'])

Or a lambda :

df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()

Try list comprehension:

score = pd.concat([score] * 1000, ignore_index=True)

def Points(a,b):
    return (a,b)

In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])]
1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()]
3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Have you ever tried the method .itertuples() ?

my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()]

Is a faster way to iterate over a pandas dataframe.

I hope it help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM