[英]How to convert a list of lists of tuples into a pandas DataFrame as time efficient as possible?
我有一個元組列表作為輸入,如下所示:
[[("apple", "dog", 5), ("banana", "cat", 32.3), ("pineapple", "horse", 33)], [("apple", "dog", 0), ("pear", "dog", 8), ("pear", "cow", 5.5)], [("apple", "dog", 7), ("banana", "dog", 4)]]
我想將其轉換為包含元組每個條目的列表列的 pandas DataFrame。 因此,一列包含列表中的所有第一個元素,一列包含列表中的所有第二個元素,依此類推...
我知道我可以創建一個 pandas DataFrame 由 1 個元組列表類型列組成,如下所示:
df = pd.DataFrame({"A": [[("apple", "dog", 5), ("banana", "cat", 32.3), ("pineapple", "horse", 33)], [("apple", "dog", 0), ("pear", "dog", 8), ("pear", "cow", 5.5)], [("apple", "dog", 7), ("banana", "dog", 4)]]})
但我不知道如何將其轉換為(在此示例中)3 列 DataFrame。
所需的 output 如下所示:
first second third
["apple", "banana", "pineapple"] ["dog", "cat", "horse"] [5, 32.3, 33]
["apple", "pear", "pear"] ["dog", "dog", "cow"] [0, 8, 5.5]
["apple", "banana"] ["dog", "dog"] [7, 4]
如何盡可能省時地解決這個問題? 我真正的 DataFrame 將包含大約 1M 行,所以我希望盡可能快的解決方案。
在將數據放入 DataFrame 之前,使用zip
格式化數據。
pd.DataFrame([zip(*row) for row in data], columns=["first", "second", "third"]).applymap(list)
first second third
0 [apple, banana, pineapple] [dog, cat, horse] [5, 32.3, 33]
1 [apple, pear, pear] [dog, dog, cow] [0, 8, 5.5]
2 [apple, banana] [dog, dog] [7, 4]
python 選項:
import pandas as pd
data = [
[("apple", "dog", 5), ("banana", "cat", 32.3), ("pineapple", "horse", 33)],
[("apple", "dog", 0), ("pear", "dog", 8), ("pear", "cow", 5.5)],
[("apple", "dog", 7), ("banana", "dog", 4)]
]
df = pd.DataFrame([map(list, zip(*row)) for row in data],
columns=["first", "second", "third"])
print(df)
df
:
first second third
0 [apple, banana, pineapple] [dog, cat, horse] [5, 32.3, 33]
1 [apple, pear, pear] [dog, dog, cow] [0, 8, 5.5]
2 [apple, banana] [dog, dog] [7, 4]
性能 Python map
+ list
vs applymap
via perfplot:
import numpy as np
import pandas as pd
import perfplot
np.random.seed(5)
in_zero = ['apple', 'banana', 'pear']
in_one = ['cat', 'dog', 'cow', 'horse']
def gen_data(n):
in_lst = []
for _ in range(n):
in_lst.append(np.array([
np.random.choice(in_zero, 3),
np.random.choice(in_one, 3),
np.random.random(3) * 33,
]).transpose().tolist())
return in_lst
def pure_python(data):
return pd.DataFrame([map(list, zip(*row)) for row in data],
columns=["first", "second", "third"])
def apply_map(data):
return pd.DataFrame([zip(*row) for row in data],
columns=["first", "second", "third"]).applymap(list)
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
pure_python,
apply_map
],
labels=[
'pure_python',
'apply_map'
],
n_range=[2 ** k for k in range(20)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)
我還是會用pandas
function explode
s = df.explode('A')['A']
out = pd.DataFrame(s.tolist(),index=s.index).groupby(level=0).agg(list)
0 1 2
0 [apple, banana, pineapple] [dog, cat, horse] [5.0, 32.3, 33.0]
1 [apple, pear, pear] [dog, dog, cow] [0.0, 8.0, 5.5]
2 [apple, banana] [dog, dog] [7.0, 4.0]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.