簡體   English   中英

如何盡可能高效地將元組列表轉換為 pandas DataFrame?

[英]How to convert a list of lists of tuples into a pandas DataFrame as time efficient as possible?

我有一個元組列表作為輸入,如下所示:

[[("apple", "dog", 5), ("banana", "cat", 32.3), ("pineapple", "horse", 33)], [("apple", "dog", 0), ("pear", "dog", 8), ("pear", "cow", 5.5)], [("apple", "dog", 7), ("banana", "dog", 4)]]

我想將其轉換為包含元組每個條目的列表列的 pandas DataFrame。 因此,一列包含列表中的所有第一個元素,一列包含列表中的所有第二個元素,依此類推...

我知道我可以創建一個 pandas DataFrame 由 1 個元組列表類型列組成,如下所示:

df = pd.DataFrame({"A": [[("apple", "dog", 5), ("banana", "cat", 32.3), ("pineapple", "horse", 33)], [("apple", "dog", 0), ("pear", "dog", 8), ("pear", "cow", 5.5)], [("apple", "dog", 7), ("banana", "dog", 4)]]})

但我不知道如何將其轉換為(在此示例中)3 列 DataFrame。

所需的 output 如下所示:

first                            second                    third
["apple", "banana", "pineapple"] ["dog", "cat", "horse"]   [5, 32.3, 33]
["apple", "pear", "pear"]        ["dog", "dog", "cow"]     [0, 8, 5.5]
["apple", "banana"]              ["dog", "dog"]            [7, 4]  

如何盡可能省時地解決這個問題? 我真正的 DataFrame 將包含大約 1M 行,所以我希望盡可能快的解決方案。

在將數據放入 DataFrame 之前,使用zip格式化數據。

pd.DataFrame([zip(*row) for row in data], columns=["first", "second", "third"]).applymap(list)

                        first             second          third
0  [apple, banana, pineapple]  [dog, cat, horse]  [5, 32.3, 33]
1         [apple, pear, pear]    [dog, dog, cow]    [0, 8, 5.5]
2             [apple, banana]         [dog, dog]         [7, 4]

python 選項:

import pandas as pd

data = [
    [("apple", "dog", 5), ("banana", "cat", 32.3), ("pineapple", "horse", 33)],
    [("apple", "dog", 0), ("pear", "dog", 8), ("pear", "cow", 5.5)],
    [("apple", "dog", 7), ("banana", "dog", 4)]
]

df = pd.DataFrame([map(list, zip(*row)) for row in data],
                  columns=["first", "second", "third"])

print(df)

df

                        first             second          third
0  [apple, banana, pineapple]  [dog, cat, horse]  [5, 32.3, 33]
1         [apple, pear, pear]    [dog, dog, cow]    [0, 8, 5.5]
2             [apple, banana]         [dog, dog]         [7, 4]

性能 Python map + list vs applymap via perfplot:

perfplot 計時

import numpy as np
import pandas as pd
import perfplot

np.random.seed(5)

in_zero = ['apple', 'banana', 'pear']
in_one = ['cat', 'dog', 'cow', 'horse']


def gen_data(n):
    in_lst = []
    for _ in range(n):
        in_lst.append(np.array([
            np.random.choice(in_zero, 3),
            np.random.choice(in_one, 3),
            np.random.random(3) * 33,
        ]).transpose().tolist())
    return in_lst


def pure_python(data):
    return pd.DataFrame([map(list, zip(*row)) for row in data],
                        columns=["first", "second", "third"])


def apply_map(data):
    return pd.DataFrame([zip(*row) for row in data],
                        columns=["first", "second", "third"]).applymap(list)


if __name__ == '__main__':
    out = perfplot.bench(
        setup=gen_data,
        kernels=[
            pure_python,
            apply_map
        ],
        labels=[
            'pure_python',
            'apply_map'
        ],
        n_range=[2 ** k for k in range(20)],
        equality_check=None
    )
    out.save('perfplot_results.png', transparent=False)

我還是會用pandas function explode

s = df.explode('A')['A']
out = pd.DataFrame(s.tolist(),index=s.index).groupby(level=0).agg(list)
                            0                  1                  2
0  [apple, banana, pineapple]  [dog, cat, horse]  [5.0, 32.3, 33.0]
1         [apple, pear, pear]    [dog, dog, cow]    [0.0, 8.0, 5.5]
2             [apple, banana]         [dog, dog]         [7.0, 4.0]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM