简体   繁体   中英

Pandas filter smallest by group

I have a data frame that has the following format:

d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
    df = pd.DataFrame(data=d)


print(df)
     id1    id2    score
0     a      a       1        
1     a      b       2             
3     b      b       3        
4     b      c       4

The data frame has over 1 billion rows, it represents pairwise distance scores between objects in columns id1 and id2. I do not need all object pair combinations, for each object in id1 (there are about 40k unique id's) I only want to keep the top 100 closest (smallest) distance scores

The code I'm running to do this is the following:

df = df.groupby(['id1'])['score'].nsmallest(100)

The issue with this code is that I run into a memory error each time I try to run it

MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64

I'm assuming it is because in the background pandas is now creating a new data frame for the result of the group by, but the existing data frame is still held in memory.

The reason I am only taking the top 100 of each id is to reduce the size of the data frame, but I seems that while doing that process I am actually taking up more space.

Is there a way I can go about filtering this data down but not taking up more memory?

The desired output would be something like this (assuming top 1 instead of top 100)

     id1    id2    score
0     a      a       1        
1     b      b       3            

Some additional info about the original df:

df.count()
permid_1    1144468900
permid_2    1144468900
distance    1144468900
dtype: int64

df.dtypes
permid_1      int64
permid_2      int64
distance    float64

df.shape
dtype: object
(1144468900, 3)

id1 & id2 unique value counts: 33,830

I can't test this code, lacking your data, but perhaps try something like this:

indicies = []
for the_id in df['id1'].unique():
    scores = df['score'][df['id1'] == the_id]
    min_subindicies = np.argsort(scores.values)[:100]  # numpy is raw index only
    min_indicies = scores.iloc[min_subindicies].index  # convert to pandas indicies
    indicies.extend(min_indicies)

df = df.loc[indicies]

Descriptively, in each unique ID ( the_id ), extract the matching scores. Then find the raw indicies which are the smallest 100. Select those indicies, then map from the raw index to the Pandas index. Save the Pandas index to your list. Then at the end, subset on the pandas index.

iloc does take a list input. some_series.iloc should align properly with some_series.values which should allow this to work. Storing indicies indirectly like this should make this substantially more memory-efficient.

df['score'][df['id1'] == the_id] should work more efficiently than df.loc[df['id1'] == the_id, 'score'] . Instead of taking the whole data frame and masking it, it takes only the score column of the data frame and masks it for matching IDs. You may want to del scores at the end of each loop if you want to immediately free more memory.

You can try the following:

df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])

df = df.loc[df["dummy_key"]]
  1. You sort ascending (smallest on top), by first grouping, then by score.

  2. You add column to indicate whether current id1 is different than the one 100 rows back (if it's not - your row is 101+ in order).

  3. You filter by column from 2.

As Aryerez outlined in a comment, you can do something along the lines of:

closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for 
    id1 in set(df['id1'])])

You could also do

def get_hundredth(id1):
    sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
    return sub_df.iloc[100]['score']

hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}

def check_distance(row):
    return row['score'] <= hundredth_dict[row['id1']]

closest = df.loc[df.apply(check_distance, axis = 1)

Another strategy would be to look at how filtering out distances past a threshold affects the dataframe. That is, take

   low_scores = df.loc[df['score']<threshold]

Does this significantly decrease the size of the dataframe for some reasonable threshold? You'd need a threshold that makes the dataframe small enough to work with, but leaves the lowest 100 scores for each id1 .

You also might want to look into what sort of optimization you can do given your distance metric. There's probably algorithms out there specifically for cosine similarity.

For the given shape (1144468900, 3) with 33,830 unique value counts, id1 and id2 columns are good candidates for categorical column, convert them to categorical data type, and that will reduce the memory requirement by 1144468900/33,830 = 33,830 times approximately for these two columns, then perform any aggregation you want.

df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM