简体   繁体   English

熊猫按组过滤最小

[英]Pandas filter smallest by group

I have a data frame that has the following format:我有一个具有以下格式的数据框:

d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
    df = pd.DataFrame(data=d)


print(df)
     id1    id2    score
0     a      a       1        
1     a      b       2             
3     b      b       3        
4     b      c       4

The data frame has over 1 billion rows, it represents pairwise distance scores between objects in columns id1 and id2.数据框有超过 10 亿行,它表示 id1 和 id2 列中对象之间的成对距离分数。 I do not need all object pair combinations, for each object in id1 (there are about 40k unique id's) I only want to keep the top 100 closest (smallest) distance scores我不需要所有对象对组合,对于 id1 中的每个对象(大约有 40k 个唯一 ID)我只想保留前 100 个最近(最小)距离分数

The code I'm running to do this is the following:我正在运行的代码如下:

df = df.groupby(['id1'])['score'].nsmallest(100)

The issue with this code is that I run into a memory error each time I try to run it这段代码的问题是我每次尝试运行它时都会遇到内存错误

MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64

I'm assuming it is because in the background pandas is now creating a new data frame for the result of the group by, but the existing data frame is still held in memory.我假设这是因为在后台熊猫现在正在为 group by 的结果创建一个新的数据框,但现有的数据框仍然保存在内存中。

The reason I am only taking the top 100 of each id is to reduce the size of the data frame, but I seems that while doing that process I am actually taking up more space.我只取每个 id 的前 100 个的原因是为了减小数据框的大小,但我似乎在执行该过程时实际上占用了更多空间。

Is there a way I can go about filtering this data down but not taking up more memory?有没有办法可以过滤掉这些数据但不占用更多内存?

The desired output would be something like this (assuming top 1 instead of top 100)所需的输出将是这样的(假设前 1 名而不是前 100 名)

     id1    id2    score
0     a      a       1        
1     b      b       3            

Some additional info about the original df:关于原始 df 的一些附加信息:

df.count()
permid_1    1144468900
permid_2    1144468900
distance    1144468900
dtype: int64

df.dtypes
permid_1      int64
permid_2      int64
distance    float64

df.shape
dtype: object
(1144468900, 3)

id1 & id2 unique value counts: 33,830

I can't test this code, lacking your data, but perhaps try something like this:由于缺少您的数据,我无法测试此代码,但也许可以尝试以下操作:

indicies = []
for the_id in df['id1'].unique():
    scores = df['score'][df['id1'] == the_id]
    min_subindicies = np.argsort(scores.values)[:100]  # numpy is raw index only
    min_indicies = scores.iloc[min_subindicies].index  # convert to pandas indicies
    indicies.extend(min_indicies)

df = df.loc[indicies]

Descriptively, in each unique ID ( the_id ), extract the matching scores.描述性地,在每个唯一 ID ( the_id ) 中,提取匹配的分数。 Then find the raw indicies which are the smallest 100. Select those indicies, then map from the raw index to the Pandas index.然后找到最小的 100 个原始索引。选择这些索引,然后从原始索引映射到 Pandas 索引。 Save the Pandas index to your list.将 Pandas 索引保存到您的列表中。 Then at the end, subset on the pandas index.然后在最后,pandas 索引上的子集。

iloc does take a list input. iloc确实需要一个列表输入。 some_series.iloc should align properly with some_series.values which should allow this to work. some_series.iloc应该与some_series.values正确对齐,这应该允许它工作。 Storing indicies indirectly like this should make this substantially more memory-efficient.像这样间接存储索引应该可以显着提高内存效率。

df['score'][df['id1'] == the_id] should work more efficiently than df.loc[df['id1'] == the_id, 'score'] . df['score'][df['id1'] == the_id]应该比df.loc[df['id1'] == the_id, 'score'] Instead of taking the whole data frame and masking it, it takes only the score column of the data frame and masks it for matching IDs.它不是获取整个数据框并对其进行屏蔽,而是仅获取数据框的 score 列并将其屏蔽以匹配 ID。 You may want to del scores at the end of each loop if you want to immediately free more memory.如果您想立即释放更多内存,您可能希望在每个循环结束时del scores

You can try the following:您可以尝试以下操作:

df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])

df = df.loc[df["dummy_key"]]
  1. You sort ascending (smallest on top), by first grouping, then by score.您按升序排序(最小的在顶部),先分组,然后按分数。

  2. You add column to indicate whether current id1 is different than the one 100 rows back (if it's not - your row is 101+ in order).您添加列以指示当前id1是否与后面的 100 行不同(如果不是 - 您的行按顺序是 101+)。

  3. You filter by column from 2.您按 2 中的列过滤。

As Aryerez outlined in a comment, you can do something along the lines of:正如 Aryerez 在评论中概述的那样,您可以执行以下操作:

closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for 
    id1 in set(df['id1'])])

You could also do你也可以这样做

def get_hundredth(id1):
    sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
    return sub_df.iloc[100]['score']

hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}

def check_distance(row):
    return row['score'] <= hundredth_dict[row['id1']]

closest = df.loc[df.apply(check_distance, axis = 1)

Another strategy would be to look at how filtering out distances past a threshold affects the dataframe.另一种策略是查看过滤掉超过阈值的距离如何影响数据帧。 That is, take也就是说,取

   low_scores = df.loc[df['score']<threshold]

Does this significantly decrease the size of the dataframe for some reasonable threshold?对于某些合理的阈值,这是否会显着减小数据帧的大小? You'd need a threshold that makes the dataframe small enough to work with, but leaves the lowest 100 scores for each id1 .您需要一个阈值,使数据框足够小以使用,但为每个id1留下最低的 100 分。

You also might want to look into what sort of optimization you can do given your distance metric.您可能还想研究根据距离度量可以进行哪些优化。 There's probably algorithms out there specifically for cosine similarity.可能有专门针对余弦相似度的算法。

For the given shape (1144468900, 3) with 33,830 unique value counts, id1 and id2 columns are good candidates for categorical column, convert them to categorical data type, and that will reduce the memory requirement by 1144468900/33,830 = 33,830 times approximately for these two columns, then perform any aggregation you want.对于具有33,830唯一值计数的给定形状(1144468900, 3)id1id2列是分类列的良好候选者,将它们转换为分类数据类型,这将减少大约1144468900/33,830 = 33,830倍的内存需求两列,然后执行您想要的任何聚合。

df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM