[英]How to filter and a find a subset of a dataframe in which categorical data in two columns occur more than n, m times
I have a dataframe from a csv which contains userId, ISBN and ratings for a bunch of books. 我有一个来自csv的数据框,其中包含userId,ISBN和一堆书的评级。 I want to find a subset of this dataframe in which both userIds occur more than 200 times and ISBNs occur more than 100 times.
我想找到这个数据帧的一个子集,其中userIds发生超过200次,ISBN发生超过100次。
Following is what I tried: 以下是我试过的:
ratings = pd.read_csv('../data/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userId', 'ISBN', 'bookRating']
# Choose users with more than 200 ratings and books with more than 100 ratings
user_rating_count = ratings['userId'].value_counts()
relevant_ratings = ratings[ratings['userId'].isin(user_rating_count[user_rating_count >= 200].index)]
print(relevant_ratings.head())
print(relevant_ratings.shape)
books_rating_count = relevant_ratings['ISBN'].value_counts()
relevant_ratings_book = relevant_ratings[relevant_ratings['ISBN'].isin(
books_rating_count[books_rating_count >= 100].index)]
print(relevant_ratings_book.head())
print(relevant_ratings_book.shape)
# Check that userId occurs more than 200 times
users_grouped = pd.DataFrame(relevant_ratings.groupby('userId')['bookRating'].count()).reset_index()
users_grouped.columns = ['userId', 'ratingCount']
sorted_users = users_grouped.sort_values('ratingCount')
print(sorted_users.head())
# Check that ISBN occurs more than 100 times
books_grouped = pd.DataFrame(relevant_ratings.groupby('ISBN')['bookRating'].count()).reset_index()
books_grouped.columns = ['ISBN', 'ratingCount']
sorted_books = books_grouped.sort_values('ratingCount')
print(sorted_books.head())
Following is the output I got: 以下是我得到的输出:
userId ISBN bookRating
1456 277427 002542730X 10
1457 277427 0026217457 0
1458 277427 003008685X 8
1459 277427 0030615321 0
1460 277427 0060002050 0
(527556, 3)
userId ISBN bookRating
1469 277427 0060930535 0
1471 277427 0060934417 0
1474 277427 0061009059 9
1495 277427 0142001740 0
1513 277427 0312966091 0
(13793, 3)
userId ratingCount
73 26883 200
298 99955 200
826 252827 200
107 36554 200
240 83671 200
ISBN ratingCount
0 0330299891 1
132873 074939918X 1
132874 0749399201 1
132875 074939921X 1
132877 0749399295 1
As seen above when sorting the table in ascending order grouped by userId, it shows userIds only more than 200 times. 如上所示,当按userId按升序对表进行排序时,它只显示userIds超过200次。 But when sorting the table in ascending order grouped by ISBN, it shows ISBNs which occurs even 1 time.
但是,当按照按ISBN分组的升序对表格进行排序时,它会显示甚至出现一次的ISBN。
I expected both userIds and ISBNs to occur more than 200 and 100 times respectively. 我希望userIds和ISBN分别出现200次和100次以上。 Please let me know what I have done wrong and how to get the correct result.
请让我知道我做错了什么以及如何获得正确的结果。
You should try and produce a small version of the problem that can be solved without access to large csv files. 您应该尝试生成可以在不访问大型csv文件的情况下解决的小问题。 Check this page for more details: https://stackoverflow.com/help/how-to-ask
查看此页面了解更多详细信息: https : //stackoverflow.com/help/how-to-ask
That said, here is a dummy version of your dataset: 也就是说,这是您的数据集的虚拟版本:
import pandas as pd
import random
import string
n=1000
isbn = [random.choice(['abc','def','ghi','jkl','mno']) for x in range(n)]
rating = [random.choice(range(9)) for x in range(n)]
userId = [random.choice(['x','y','z']) for x in range(n)]
df = pd.DataFrame({'isbn':isbn,'rating':rating,'userId':userId})
You can get the counts by userId and isbns this way: 您可以通过userId和isbns获取计数:
df_userId_count = df.groupby('userId',as_index=False)['rating'].count()
df_isbn_count = df.groupby('isbn',as_index=False)['rating'].count()
and extract the unique values by: 并通过以下方式提取唯一值:
userId_select = (df_userId_count[df_userId_count.rating>200].userId.values)
isbn_select = (df_isbn_count[df_isbn_count.rating>100].isbn.values)
So that your final filtered dataframe is: 这样您的最终过滤数据框就是:
df = df[df.userId.isin(userId_select) & df.isbn.isin(isbn_select) ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.