简体   繁体   中英

Sorting pandas groupby object

I am following a tutorial on a analysing movie dataset. Snippet of features as following.

在此处输入图片说明

The following was done by the tutorial
Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.
The tutorial has sorted the userID group object based on key=lambda x: len(x[1]) From the output, it seemed that it is sorting based on the userid value or i might be wrong. Appreciate your explanation.

Can please explain
1)what is the x[1] ? I tried x[0] , it return error msg as int has no len()'

2)and how does sorting len(x[1]) sort the userid or whatever it is sorting?

3) Why after sort, the userSubsetGroup has become a list with nested tuple of (userId, groupby('userId').get_group())

Code in Tutorial

userSubsetGroup = userSubset.groupby(['userId'])
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

Output of sorting

在此处输入图片说明

I'll try:

  1. pd.DataFrame.groupby returns a DataFrameGroupBy object:

    Returns DataFrameGroupBy

    Returns a groupby object that contains information about the groups.

    The items in this object are tuples, one for each group produced by the .groupby .

    The first item - x[0] - of such a tuple is the unique value from the column you are grouping by that defines the group: You are grouping by the column userID , therefore the first element of the tuple is a userID . Each userID is used only once because the groupby is pulling them together. Try

    userIDs = [x[0] for x in userSubsetGroup]

    and print the list. It will contain all userIDs , each only once. Since your userID s are integers, the application of len doesn't make sense, hence the respective error.

    The second item - x[1] - of such a tuple is a dataframe, the slice of the original dataframe that belongs to the first element of the tuple. If you look a the last picture you will see that each dataframe in a tuple has identical userID s - that is the grouping at work.

  2. The object that gets sorted is the DataFrameGroupBy object. The sorting criteria here is the length of the dataframes (number of rows), the second element of the tuples in the object. You could replace the key with key=lambda x: x[1].shape[0] and would get the same result, only here using a property shape of a dataframe, instead of the more unspecific len function. (Due to reverse=True the sorting is reversed, ie the tuples with the longest dataframes come first.)

  3. According to the sorted documentation :

    Return a new sorted list from the items in iterable.

    and the items in userSubsetGroup are the tuples described in 1.

You could run something like this

example = pd.DataFrame({
    'A': [1, 1, 1, 2, 2, 3],
    'B': range(1, 7),
    'C': ['a', 'b', 'c', 'a', 'b', 'a']
})

grouped_by_A = example.groupby('A')
print(f'type of grouped_by_A: {type(grouped_by_A)}')

for x in sorted(grouped_by_A, key=lambda x: len(x[1]), reverse=True):
    print(f'type of x: {type(x)}')
    print('x[0]:', x[0])
    print('type of x[0]:', type(x[0]))
    print('x[1]:\n', x[1])
    print('type of x[1]:', type(x[1]))
    print('len(x[1]):', len(x[1]), '\n')

to confirm the structure I've tried to describe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM