具有熊猫排列的多个嵌套列表

Question

I have my first serious question in python.我在 python 中有我的第一个严肃问题。

I have a few nested lists that I need to convert to pandas DataFrame.我有一些嵌套列表需要转换为 Pandas DataFrame。 Seems easy, but what makes it challenging for me: - the lists are huge (so the code needs to be fast) - they are nested - when they are nested, I need combinations.看起来很简单，但对我来说是什么让它具有挑战性： - 列表很大（所以代码需要很快） - 它们是嵌套的 - 当它们被嵌套时，我需要组合。

So having this input:所以有这个输入：

la =  ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]

I need the below as output我需要以下作为输出

la      lb      lc
a       1       1
b       2       2
b       2       22
c       3       3
c       33      3
d       11      11
d       11      12
d       11      13
d       12      11
d       12      12
d       12      13
d       13      11
d       13      12
d       13      13
e       4       4

Note that I need all permutations whenever I have a nested list.请注意，每当我有嵌套列表时，我都需要所有排列。 At first I tried simply:起初我只是简单地尝试：

import pandas as pd
pd.DataFrame({'la' : [x for x in la],
              'lb' : [x for x in lb],
              'lc' : [x for x in lc]})

But looking for rows that need expanding and actually expanding (a huge) DataFrame seemed harder than tinkering around the way I create the DataFrame.但是寻找需要扩展和实际扩展（一个巨大的）DataFrame 的行似乎比修改我创建 DataFrame 的方式更难。

I looked at some great posts about itertools ( Flattening a shallow list in Python ), the documentation ( https://docs.python.org/3.6/library/itertools.html ) and generators ( What does the "yield" keyword do? ), and came up with something like this:我看了一些关于 itertools（在 Python 中展平一个浅表）、文档（ https://docs.python.org/3.6/library/itertools.html ）和生成器（ “yield”关键字有什么作用？），并想出了这样的事情：

import itertools

def f(la, lb, lc):
    tmp = len(la) == len(lb) == len(lc)
    if tmp:
        for item in range(len(la)):
            len_b = len(lb[item])
            len_c = len(lc[item])
            if ((len_b>1) or (len_c>1)):
                yield list(itertools.product(la[item], lb[item], lc[item]))
                ## above: list is not the result I need,
                ##        without it it breaks (not an iterable)
            else:
                yield (la[item], lb[item], lc[item])
    else:
        print('error: unequal length')

which I test我测试的

my_gen =f(lit1, lit2, lit3)
pd.DataFrame.from_records(my_gen)

which... well... breaks when i yield itertools (it has no length), and creates a wrong data structure after I cast itertools to an iterable.这......好吧......当我yield itertools （它没有长度）时会中断，并在我将itertools转换为可迭代后创建错误的数据结构。

My questions are as follow:我的问题如下：

how can I fix that issue with yield ing itertools ?如何使用yield ing itertools解决该问题？
is this efficient?这有效率吗？ In real application I will be creating the lists by parsing a file and they will be huge... Any performance tips or better solutions from more advanced colleagues?在实际应用中，我将通过解析文件来创建列表，它们将是巨大的......任何性能提示或来自更高级同事的更好解决方案？ Right not it breaks/misbehaves so I can't even benchmark...是的，它不会中断/行为不端，所以我什至无法进行基准测试...
would it make sense to generate the lists element by element and then use my f function?逐个元素生成列表然后使用我的f函数有意义吗？

Thank you in advance!先感谢您！

Answer 1

I have a solution:我有一个解决方案：

import pandas as pd
from itertools import product

la =  ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]

list_product = reduce(lambda x, y: x + y, [list(product(*_)) for _ in zip(la,lb,lc)])
df = pd.DataFrame(list_product, columns=["la", "lb", "lc"])
print(df)

result:结果：

    la  lb  lc
0   a   1   1
1   b   2   2
2   b   2   22
3   c   3   3
4   c   33  3
5   d   11  11
6   d   11  12
7   d   11  13
8   d   12  11
9   d   12  12
10  d   12  13
11  d   13  11
12  d   13  12
13  d   13  13
14  e   4   4

Answer 2

It's not an abstract solution, but it does get the results you are looking for.这不是一个抽象的解决方案，但它确实得到了您正在寻找的结果。 I look forward to seeing a more pandas-centric answer to this problem, but offer this up in the mean time.我期待看到一个更以熊猫为中心的答案来解决这个问题，但同时提供这个答案。

import pandas as pd
la =  ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]

l1 = []
l2 = []
l3 = []

l1Temp = []
l2Temp = []
l3Temp = []

for i, listInt in enumerate(lb):
    if type(listInt == list):
        for j, item in enumerate(listInt):
            # print('%s - %s' % (lb[i], lc[i][j]))
            l1Temp.append(la[i])
            l2Temp.append(lb[i][j])
            l3Temp.append(lc[i])
            # print('%s - %s' % (l1[i], l2[i]))
    else:
        l1Temp.append(la[i])
        l2Temp.append(lb[i])
        l3Temp.append(lc[i])
        # print('%s - %s' % (lb[i], lc[i]))

for i, listInt in enumerate(l3Temp):
    if type(listInt == list):
        for j, item in enumerate(listInt):
            l1.append(l1Temp[i])
            l2.append(l2Temp[i])
            l3.append(l3Temp[i][j])
    else:
        l1.append(l1Temp[i])
        l2.append(l2Temp[i])
        l3.append(l3Temp[i])

for i, item in enumerate(l3):
    print('%s - %s - %s' % (l1[i], l2[i], l3[i]))

df = pd.DataFrame({'la':[x for x in l1],
    'lb':[x for x in l2],
    'lc': [x for x in l3]})
print(df)

具有熊猫排列的多个嵌套列表

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-02-25 01:11:11

解决方案2
0 2018-02-24 23:29:12

具有熊猫排列的多个嵌套列表

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-02-25 01:11:11

解决方案2 0 2018-02-24 23:29:12

解决方案1
2 已采纳 2018-02-25 01:11:11

解决方案2
0 2018-02-24 23:29:12