简体   繁体   English

如何过滤和查找数据帧的子集,其中两列中的分类数据出现的次数超过n次,m次

[英]How to filter and a find a subset of a dataframe in which categorical data in two columns occur more than n, m times

I have a dataframe from a csv which contains userId, ISBN and ratings for a bunch of books. 我有一个来自csv的数据框,其中包含userId,ISBN和一堆书的评级。 I want to find a subset of this dataframe in which both userIds occur more than 200 times and ISBNs occur more than 100 times. 我想找到这个数据帧的一个子集,其中userIds发生超过200次,ISBN发生超过100次。

Following is what I tried: 以下是我试过的:

ratings = pd.read_csv('../data/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userId', 'ISBN', 'bookRating']

# Choose users with more than 200 ratings and books with more than 100 ratings

user_rating_count = ratings['userId'].value_counts()
relevant_ratings = ratings[ratings['userId'].isin(user_rating_count[user_rating_count >= 200].index)]
print(relevant_ratings.head())
print(relevant_ratings.shape)
books_rating_count = relevant_ratings['ISBN'].value_counts()
relevant_ratings_book = relevant_ratings[relevant_ratings['ISBN'].isin(
    books_rating_count[books_rating_count >= 100].index)]
print(relevant_ratings_book.head())
print(relevant_ratings_book.shape)

# Check that userId occurs more than 200 times

users_grouped = pd.DataFrame(relevant_ratings.groupby('userId')['bookRating'].count()).reset_index()
users_grouped.columns = ['userId', 'ratingCount']
sorted_users = users_grouped.sort_values('ratingCount')
print(sorted_users.head())

# Check that ISBN occurs more than 100 times

books_grouped = pd.DataFrame(relevant_ratings.groupby('ISBN')['bookRating'].count()).reset_index()
books_grouped.columns = ['ISBN', 'ratingCount']
sorted_books = books_grouped.sort_values('ratingCount')
print(sorted_books.head())

Following is the output I got: 以下是我得到的输出:

      userId        ISBN  bookRating
1456  277427  002542730X          10
1457  277427  0026217457           0
1458  277427  003008685X           8
1459  277427  0030615321           0
1460  277427  0060002050           0
(527556, 3)
      userId        ISBN  bookRating
1469  277427  0060930535           0
1471  277427  0060934417           0
1474  277427  0061009059           9
1495  277427  0142001740           0
1513  277427  0312966091           0
(13793, 3)
     userId  ratingCount
73    26883          200
298   99955          200
826  252827          200
107   36554          200
240   83671          200
               ISBN  ratingCount
0        0330299891            1
132873   074939918X            1
132874   0749399201            1
132875   074939921X            1
132877   0749399295            1

As seen above when sorting the table in ascending order grouped by userId, it shows userIds only more than 200 times. 如上所示,当按userId按升序对表进行排序时,它只显示userIds超过200次。 But when sorting the table in ascending order grouped by ISBN, it shows ISBNs which occurs even 1 time. 但是,当按照按ISBN分组的升序对表格进行排序时,它会显示甚至出现一次的ISBN。

I expected both userIds and ISBNs to occur more than 200 and 100 times respectively. 我希望userIds和ISBN分别出现200次和100次以上。 Please let me know what I have done wrong and how to get the correct result. 请让我知道我做错了什么以及如何获得正确的结果。

You should try and produce a small version of the problem that can be solved without access to large csv files. 您应该尝试生成可以在不访问大型csv文件的情况下解决的小问题。 Check this page for more details: https://stackoverflow.com/help/how-to-ask 查看此页面了解更多详细信息: https//stackoverflow.com/help/how-to-ask

That said, here is a dummy version of your dataset: 也就是说,这是您的数据集的虚拟版本:

import pandas as pd
import random
import string
n=1000
isbn = [random.choice(['abc','def','ghi','jkl','mno']) for x in range(n)]
rating = [random.choice(range(9)) for x in range(n)]
userId = [random.choice(['x','y','z']) for x in range(n)]
df = pd.DataFrame({'isbn':isbn,'rating':rating,'userId':userId})

You can get the counts by userId and isbns this way: 您可以通过userId和isbns获取计数:

df_userId_count = df.groupby('userId',as_index=False)['rating'].count()
df_isbn_count = df.groupby('isbn',as_index=False)['rating'].count()

and extract the unique values by: 并通过以下方式提取唯一值:

userId_select = (df_userId_count[df_userId_count.rating>200].userId.values)
isbn_select = (df_isbn_count[df_isbn_count.rating>100].isbn.values)

So that your final filtered dataframe is: 这样您的最终过滤数据框就是:

df = df[df.userId.isin(userId_select) & df.isbn.isin(isbn_select) ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM