简体   繁体   English

当 Python 中的值总和为 n 时,如何对列表中的元素求和并使用它们创建子列表

[英]How to sum elements from a List and create sublist with them when they sum to a value of n in Python

I am kinda stuck.我有点卡住了。

So here is my scenario.所以这是我的场景。

I have a list of small file(parquet files).我有一个小文件列表(镶木地板文件)。 My goal is track them and merge them in a more optimal parquet files size.我的目标是跟踪它们并将它们合并为更优化的镶木地板文件大小。

While i could just read all and run repartition, this wont apply to my use case, since they share location with other already partitioned files(would be to much to handle in terms of data volume).虽然我可以读取所有内容并运行重新分区,但这不适用于我的用例,因为它们与其他已分区的文件共享位置(就数据量而言,需要处理很多)。

So i have list That looks like this:所以我有列表看起来像这样:

[
[('filepath.parquet',1000),
('filepath.parquet',1000)],

[('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)],

 [('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)]
]

My Goal is to have max group size parameter that will create sub lists of this lists based on the sum of bytes.我的目标是拥有最大组大小参数,该参数将根据字节总和创建此列表的子列表。

Using my list example with a max_group of 5000 where i have one main list with 3 sub lists i would get:使用我的列表示例,max_group 为 5000,其中我有一个带有 3 个子列表的主列表,我会得到:

1 - main list - no change here 1 - sub list 1 would keel all its elements since the sum of the bytes is only 2000 2 - sub list 2 would be split in 2 sub sub lists since the total sum is 8000 and max_group is 5000. eg: 1 - 主列表- 此处没有变化1 - 子列表 1 将保留其所有元素,因为字节的总和仅为 2000 2 - 子列表 2 将被拆分为 2 个子子列表,因为总和为 8000,max_group 为 5000 .例如:

[('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)],

[('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)]
  • so one sub sub list will be 5000 and the other will be 3000所以一个子子列表将是 5000,另一个将是 3000

3 - sub list 3 - will be split in 3 sub sub lists as bellow, again following the same max_group 3 - 子列表 3 - 将分为 3 个子子列表,如下所示,再次遵循相同的 max_group

[('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)],

[('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)],

[('filepath.parquet',1000)]

 

So my final list would be:所以我的最终名单是:

[ -- main list
    [ -- sub list 
            [ -- sub sub list
            ('filepath.parquet',1000),
            ('filepath.parquet',1000)
            ],

            [
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000)
            ],

            [
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000)
        ]
    ],

    [
            [
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000)
            ],

            [
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000),
            ('filepath.parquet',1000)
            ],
    ],
    [
            [
            ('filepath.parquet',1000)
            ]
    ]
]

So i am trying to do this in python: my code:所以我试图在 python 中执行此操作:我的代码:

lst = [
[('filepath.parquet',1000),
('filepath.parquet',1000)],

[('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)],

 [('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000),
('filepath.parquet',1000)]
]
    
    
max_group  = 5000
i = 0
for k,sublst in enumerate(lst):
    print('entering sublst: ' + str(k))
    for file in sublst:
        f, v = file
        tot = i + v
        print(f)
        while tot <= max_group:
            tot = tot + v

Here you are trying to iterate v which is an integer and get a sum.在这里,您尝试迭代 v 这是一个 integer 并得到一个总和。 The following is the sum method.下面是求和法。

def sum(*args, **kwargs): # real signature unknown
    """
    Return the sum of a 'start' value (default: 0) plus an iterable of numbers
    
    When the iterable is empty, return the start value.
    This function is intended specifically for use with numeric values and may
    reject non-numeric types.
    """
    pass

You need to pass an array to the sum method.您需要将数组传递给 sum 方法。

The following code will give you the intended solution.以下代码将为您提供预期的解决方案。

max_group = 5000
i = 0
final_list = []
for k, sublst in enumerate(lst):
    print('entering sublst: ' + str(k))
    size = 0
    temp_list = []
    for file in sublst:
        f, v = file
        size += v

        if size > max_group:
            final_list.append(temp_list)
            temp_list = [file]
            size = v
        else:
            temp_list.append(file)

    if len(temp_list) > 0:
        final_list.append(temp_list)

First, it will append the v to the size, but it will not append f to the list.首先,它将 append 的 v 到大小,但它不会 append f 到列表。 If the size is greater than the max_group, the temp_list will be appended to the final_list and initialize the temp_list with the file.如果大小大于 max_group,则 temp_list 将附加到 final_list 并使用文件初始化 temp_list。 Then it will reinitialize size as well.然后它也会重新初始化大小。 If the size is less than max_group it will keep appending to the temp_list.如果大小小于 max_group,它将继续附加到 temp_list。 At the end of the for loop, it will check the temp_list length and if there are leftover elements in the temp_list, the temp list will be appended to the final_list as well.在 for 循环结束时,它将检查 temp_list 的长度,如果 temp_list 中有剩余元素,则临时列表也将附加到 final_list 中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM