简体   繁体   中英

Faster way to count total occurrences of values in a column of lists in pandas?

I have a column in a pandas dataframe which contains a large number of lists of labels:

>>> data['SPLIT'].head(10)
0    [33.23, 88.72, 38.93, E931.7, V09.0, 041.11, 5...
1    [99.04, 38.06, 39.57, 00.91, 55.69, V15.82, 27...
2    [96.04, 96.72, 401.9, 276.5, 584.9, 428.0, 507...
3    [96.6, 99.15, 99.83, V29.0, 765.15, 765.25, 77...
4    [96.71, 96.04, 54.12, 99.60, 38.93, 99.15, 53....
5    [88.72, 37.61, 39.61, 36.15, 36.12, 272.0, 401...
6    [38.93, 88.72, 37.31, 272.4, 719.46, 722.0, 31...
7    [88.72, 39.61, 35.71, 272.4, V12.59, 458.29, 7...
8    [97.44, 99.04, 88.56, 37.23, 39.95, 38.95, 00....
9    [00.14, 89.61, 39.95, E878.8, 244.9, 443.9, 18...

What I am trying to do is iterate through all of these lists to find the total occurrence of each value so that I can find the 50 most frequently occurring values.

This is the code I have used which runs extremely slowly:

test = pd.Series(sum([item for item in data.SPLIT], [])).value_counts()

I tried to write a function outside to loop through the lists and find the count as well, but this was also very slow.

Is there any way I can modify this data or use a function in pandas that has similar performance to something like df.groupby.count() ?

I did search for a half hour on google and stackoverflow but none of the answers have better performance. I have been trying for a long time to figure out a way to flatten the list or find a way to map the counts with better speed (iterating through 500k rows, and the length of each list varies, some can be as long as 512, others as short as 2).

Use list comprehension with flattening instead sum :

test = pd.Series([x for item in data.SPLIT for x in item]).value_counts()

Or flatten by chain.from_iterable :

from itertools import chain

test = pd.Series(list(chain.from_iterable(data.SPLIT))).value_counts()

Or use also collections.Counter :

from itertools import chain
from collections import Counter

test = pd.Series(Counter(chain.from_iterable(data.SPLIT)))

Or:

import functools, operator

test = pd.Series(functools.reduce(operator.iconcat, data.SPLIT, [])).value_counts()

Pure pandas solution:

test = pd.DataFrame(data.SPLIT.values.tolist()).stack().value_counts()

How about this?

import pandas as pd

split = data["SPLIT"].apply(pd.Series)
split = split.rename(columns = lambda x : 'val_' + str(x))
split.melt(value_name="val").groupby(["val"]).size()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM