简体   繁体   English

如何在最小和最大尺寸条件下应用熊猫组(Pythonic方式)

[英]How to apply panda group by with minimum and maximum size condition (Pythonic way)

I have a dataframe in pandas which I need to group and store in a new array where I need the size of every group with a specific size and if one exceeds the minimum size, it should be added to one of the previous groups that have the smallest size.我在 pandas 中有一个 dataframe ,我需要将其分组并存储在一个新数组中,在该数组中我需要具有特定大小的每个组的大小,如果超过最小大小,则应将其添加到具有最小的尺寸。 For example, after I grouped the data, I will have groups G that are len(G)<=b , len(G)>=a , or a <= len(G) <= b .例如,在我对数据进行分组后,我将拥有一组G ,即len(G)<=blen(G)>=aa <= len(G) <= b So, I need to make the groups with len(G)>=a to meet the condition a <= len(G) <= b .所以,我需要用len(G)>=a使组满足条件a <= len(G) <= b

The code is working now .该代码现在正在运行 So, I would like to know if there is a more convenient way to do that.所以,我想知道是否有更方便的方法来做到这一点。

import numpy as np
import pandas as pd

rng = np.random.default_rng()  # Just for testing
df = pd.DataFrame(rng.integers(0, 10, size=(1000, 4)), columns=list('ABCD'))
# The dataframe is grouped depend on specific column.
ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3], as_index=False)] 

n = 20 # The maximum size of the group is 25

new_arrayi_index = 0
new_array = []
for count_index in range(len(ans)):
    l = ans[count_index]
   
    if len(l) > n:

        df_shuffled = pd.DataFrame(l).sample(frac=1)
        final = [df_shuffled[i:i+n] for i in range(0,df_shuffled.shape[0],n)]

        for inde in range(len(final)):
            if len(final[inde]) <= 5 and new_arrayi_index != 0: #The minimum size of the group is 5

                new_array[new_arrayi_index - 1]=new_array[new_arrayi_index - 1]+final[inde]

            else:
                new_array.append(final[inde])
                new_arrayi_index += 1

    else: 

        new_array.append(l)
        new_arrayi_index += 1

count_index_ = 0
for count_index in range(len(new_array)):
    print("count", count_index, "Size", len(new_array[count_index]))
    print(new_array[count_index])
    count_index_ += count_index

print(count_index_)

change this line -> ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3], as_index=False)] to ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3].min(), as_index=False)] for min将此行 -> ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3], as_index=False)]更改为ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3].min(), as_index=False)] for min

and ans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3].max(), as_index=False)] for maxans = [pd.DataFrame(y) for x, y in df.groupby(df.columns[3].max(), as_index=False)] for max

I wrote a function that splits the dataframe into chunks that are equal to the max size.我写了一个 function 将 dataframe 分成等于最大大小的块。 It checks the size of the remainder for the last chunk, and if the remainder is smaller than the minimum size, it splits the last two chunks into two chunks of approximately equal size.它检查最后一个块的剩余部分的大小,如果剩余部分小于最小大小,它将最后两个块分成大小大致相等的两个块。

Building off answer at Split a large pandas dataframe拆分大型 pandas dataframe时建立答案

import numpy as np
import pandas as pd


rng = np.random.default_rng(seed=1)  # Just for testing
df = pd.DataFrame(rng.integers(0, 10, size=(1000, 4)), columns=list('ABCD'))
# The dataframe is grouped depend on specific column.

n = 20  # The maximum size of the group is 25


# https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe

def split_dataframe(df, chunk_size=20, min_size=10):

    chunks = list()
    remainder = len(df) % chunk_size

    if 0 < remainder < min_size:
        num_chunks = len(df) // chunk_size - 1
        for i in range(num_chunks):
            chunks.append(df[i * chunk_size:(i + 1) * chunk_size])
        df_ = df[(num_chunks) * chunk_size:]
        last_break = int(len(df_) / 2)
        chunks.append(df_[:last_break])
        chunks.append(df_[last_break:])
        return chunks
    else:
        num_chunks = len(df) // chunk_size + 1
        for i in range(num_chunks):
            chunks.append(df[i*chunk_size:(i+1)*chunk_size])
        return chunks


new_array = []
for group, df_ in df.groupby(df.columns[3], as_index=False):
    new_array.extend(split_dataframe(df_))

count_index_ = 0
for count_index in range(len(new_array)):
    print("count", count_index, "Size", len(new_array[count_index]))
    print(new_array[count_index])
    count_index_ += count_index

print(count_index_)

I was following this post since the beginning curious about how the discussion would g, because the OP's problem is not always possible to solve.我从一开始就关注这篇文章,对讨论将如何进行感到好奇,因为 OP 的问题并不总是可以解决。

Existence of a solution解决方案的存在

Take the following example: A group has 19 elements and you want to split it in sections of size between 10 and 15.举个例子:一个组有 19 个元素,你想把它分成大小在 10 到 15 之间的部分。

The solution exists if and only if exists an integer g, such that n/b <= g <= n/a .当且仅当存在 integer g 时,解决方案才存在,使得n/b <= g <= n/a In this case you can see that g sections of length a will use g*a <= n elements, and sections of length b will use g*b >= n .在这种情况下,您可以看到长度a g部分将使用g*a <= n元素,长度为b的部分将使用g*b >= n

In this situation it is also possible to have a balanced partition, in the sense that the largest section will be at most one record larger than the smallest section (the smallest will have n//g records).在这种情况下,也可以有一个平衡分区,最大的部分最多比最小的部分大一个记录(最小的部分将有n//g条记录)。

Restating the problem重述问题

We could do a slight modification to the problem as split in the minimum possible number of sections containing at most b records each.我们可以对问题进行轻微修改,将其拆分为尽可能少的部分,每个部分最多包含b条记录。 Such that the length of each section satisfy a <= len(s) <= a+1 .使得每个部分的长度满足a <= len(s) <= a+1

Notice that in this case we are adjusting a to be the closest possible from b so that the problem will have a solution.请注意,在这种情况下,我们将ab最接近,以便问题有解决方案。 For solvable problems the solution will be a solution to the original problem, for problems that can't be solved it will modify the original requirement by reducing a so that the problem can be solved.对于可解决的问题,解决方案将是原始问题的解决方案,对于无法解决的问题,它将通过减少a来修改原始需求,以便问题可以解决。

The example above would become: Split 19 elements in the minimum possible number of balanced groups with no more than 15 elements.上面的示例将变为: 在不超过 15 个元素的尽可能少的平衡组中拆分 19 个元素。 Then the solution is having one section of 10 and one section of 9 elements.然后解决方案是包含 10 个元素的部分和 9 个元素的部分。

A pythonic solution一个pythonic的解决方案

def group_and_split(df, b, column):
    '''
    - df    : a datafame
    - b     : the largest allowed section
    - column: the column by which the data must be grouped
    '''
    
    # doing it in a pythonic way
    return [np.array_split(y, (len(y)+b-1)//b)
             for x, y in df.groupby(column, as_index=False)]

You can check that it gives a solution to the restated problem您可以检查它是否为重述的问题提供了解决方案

pd.DataFrame([{
    'num-sections': len(g), 
    'largest-section': max(len(gi) for gi in g), 
    'smallest-sections':min(len(gi) for gi in g)
} for g in group_and_split(df, 25, 'D')])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM