I am following a tutorial on a analysing movie dataset. Snippet of features as following.
The following was done by the tutorialLet's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.
The tutorial has sorted the userID
group object based on key=lambda x: len(x[1])
From the output, it seemed that it is sorting based on the userid
value or i might be wrong. Appreciate your explanation.
Can please explain
1)what is the x[1]
? I tried x[0]
, it return error msg as int has no len()'
2)and how does sorting len(x[1])
sort the userid
or whatever it is sorting?
3) Why after sort, the userSubsetGroup
has become a list
with nested tuple of (userId, groupby('userId').get_group())
Code in Tutorial
userSubsetGroup = userSubset.groupby(['userId'])
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)
Output of sorting
I'll try:
pd.DataFrame.groupby
returns a DataFrameGroupBy
object:
Returns DataFrameGroupBy
Returns a groupby object that contains information about the groups.
The items in this object are tuples, one for each group produced by the .groupby
.
The first item - x[0]
- of such a tuple is the unique value from the column you are grouping by that defines the group: You are grouping by the column userID
, therefore the first element of the tuple is a userID
. Each userID
is used only once because the groupby
is pulling them together. Try
userIDs = [x[0] for x in userSubsetGroup]
and print the list. It will contain all userIDs
, each only once. Since your userID
s are integers, the application of len
doesn't make sense, hence the respective error.
The second item - x[1]
- of such a tuple is a dataframe, the slice of the original dataframe that belongs to the first element of the tuple. If you look a the last picture you will see that each dataframe in a tuple has identical userID
s - that is the grouping at work.
The object that gets sorted is the DataFrameGroupBy
object. The sorting criteria here is the length of the dataframes (number of rows), the second element of the tuples in the object. You could replace the key
with key=lambda x: x[1].shape[0]
and would get the same result, only here using a property shape
of a dataframe, instead of the more unspecific len
function. (Due to reverse=True
the sorting is reversed, ie the tuples with the longest dataframes come first.)
According to the sorted documentation :
Return a new sorted list from the items in iterable.
and the items in userSubsetGroup
are the tuples described in 1.
You could run something like this
example = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 3],
'B': range(1, 7),
'C': ['a', 'b', 'c', 'a', 'b', 'a']
})
grouped_by_A = example.groupby('A')
print(f'type of grouped_by_A: {type(grouped_by_A)}')
for x in sorted(grouped_by_A, key=lambda x: len(x[1]), reverse=True):
print(f'type of x: {type(x)}')
print('x[0]:', x[0])
print('type of x[0]:', type(x[0]))
print('x[1]:\n', x[1])
print('type of x[1]:', type(x[1]))
print('len(x[1]):', len(x[1]), '\n')
to confirm the structure I've tried to describe.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.