简体   繁体   中英

pandas get unique values from column of lists

How do I get the unique values of a column of lists in pandas or numpy such that second column from


would result in 'action', 'crime', 'drama' .

The closest (but non-functional) solutions I could come up with were:

 genres = data['Genre'].unique()

But this predictably results in a TypeError saying how lists aren't hashable.

TypeError: unhashable type: 'list'

Set seemed to be a good idea but

genres = data.apply(set(), columns=['Genre'], axis=1)

but also results in a TypeError: set() takes no keyword arguments

You can use explode :

data = pd.DataFrame([
        "title": "The Godfather: Part II",
        "genres": ["crime", "drama"],
        "director": "Fracis Ford Coppola"
        "title": "The Dark Knight",
        "genres": ["action", "crime", "drama"],
        "director": "Christopher Nolan"
# Changed from data.explode("genres")["genres"].unique() as suggested by rafaelc

Results in:

array(['crime', 'drama', 'action'], dtype=object)

If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable to concatenate all those lists

import itertools

>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')

Or even faster

>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}


df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)

%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop

%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop

%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop

%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop

Here are some options:

# toy data
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})

# 109 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# 87 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

set([x  for y in df['Genre'] for x in y])
# 11.8 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

If you're just looking to extract the information and not add back to the DataFrame, you can utilize Python's set method in a for loop:

import pandas as pd
df = pd.DataFrame({'movie':[[1,2,3],[1,2,6]]})
out = set()
for row in df['movie']:
    out.update({item for item in row})

You could also wrap this in an apply call if you wanted (which would return None but update the set in place):

out = set()
df['movie'].apply(lambda x: out.update({item for item in x}))

Personally I think the for loop is a bit clearer to read.

Not sure if it's exactly what you wanted, but this will allow you to convert it into a set.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Movie':['The Godfather', 'Dark Knight'], 'Genre': [['Crime', 'Drama'],['Crime', 'Drama', 'Action']]})

genres = []
for sublist in df['Genre']:
    for item in sublist:

genre_set = set(genres)


Output: {'Action', 'Drama', 'Crime'}

Use the power of sets for chained uniqueness . I've used this technique with huge lists, in big data like envs'. The main pro here is cut down the time needed to produce a final flat list.

  1. Convert the list-column into sets
  2. Reduce all sets into a final set, using union


from functools import reduce # for python 3

l = df.Genre.dropna().tolist()
sets = [ set(i) for i in l ]
final_set = reduce(lambda x, y: x.union(y), sets)
  • In big-data like envs', like spark, use map to convert each list into a set, then reduce like the above.
  • Change union to intersection , if you need to get all common values from all lists.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM