简体   繁体   中英

How can you find the most common sets using python?

I have a pandas dataframe where one column is a list of all courses taken by a student. The index is the student's ID.

I'd like to find the most common set of courses across all students. For instance, if the dataframe looks like this:

ID    |     Courses
1           [A, C]
2           [A, C]
3           [A, C] 
4           [B, C]
5           [B, C]
6           [K, D] 
...

Then I'd like the output to return the most common sets and their frequency, something like:

{[A,C]: 3, [B,C]: 2}

You can first convert list to tuples and then value_counts . Last use to_dict :

print (df.Courses.apply(tuple).value_counts()[:2].to_dict())
{('A', 'C'): 3, ('B', 'C'): 2}
import pandas as pd

# create example data
a = range(6)
b = [['A', 'C'], ['A', 'C'], ['A', 'C'], ['B', 'C'], ['B', 'C'], ['K', 'D']]
df = pd.DataFrame({'ID': a, 'Courses': b})

# convert lists in Courses-column to tuples (which some parts of pandas need)
df['Courses'] = df['Courses'].apply(lambda x: tuple(x))
print(df.Courses.value_counts())

Output:

(A, C)    3
(B, C)    2
(K, D)    1
Name: Courses, dtype: int64

Edit (as my answer was accepted):

jezrael describes (first as a comment to my answer) a much more compact version of the same approach:

a = range(6)
b = [['A', 'C'], ['A', 'C'], ['A', 'C'], ['B', 'C'], ['B', 'C'], ['K', 'D']]
df = pd.DataFrame({'ID': a, 'Courses': b})

print(df.Courses.value_counts())  # list->tuple and counting in one line!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM