简体   繁体   English

从条目具有不同长度的字典中创建 dataframe

[英]Creating dataframe from a dictionary where entries have different lengths

Say I have a dictionary with 10 key-value pairs.假设我有一个包含 10 个键值对的字典。 Each entry holds a numpy array.每个条目都包含一个 numpy 数组。 However, the length of the array is not the same for all of them.但是,数组的长度对于所有这些都不相同。

How can I create a dataframe where each column holds a different entry?如何创建 dataframe ,其中每列包含不同的条目?

When I try:当我尝试:

pd.DataFrame(my_dict)

I get:我得到:

ValueError: arrays must all be the same length

Any way to overcome this?有什么办法可以克服吗? I am happy to have Pandas use NaN to pad those columns for the shorter entries.我很高兴 Pandas 使用NaN填充这些列以获得较短的条目。

In Python 3.x:在 Python 3.x 中:

import pandas as pd
import numpy as np

d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
    
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))

Out[7]: 
    A  B
0   1  1
1   2  2
2 NaN  3
3 NaN  4

In Python 2.x:在 Python 2.x 中:

replace d.items() with d.iteritems() .d.items()替换d.iteritems()

Here's a simple way to do that:这是一个简单的方法来做到这一点:

In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]: 
   0  1   2   3
A  1  2 NaN NaN
B  1  2   3   4
In[23]: df.transpose()
Out[23]: 
    A  B
0   1  1
1   2  2
2 NaN  3
3 NaN  4

A way of tidying up your syntax, but still do essentially the same thing as these other answers, is below:一种整理语法但仍然与其他答案基本相同的方法如下:

>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}

>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })

>>> dict_df

   one  2    3
0  1.0  4  8.0
1  2.0  5  NaN
2  3.0  6  NaN
3  NaN  7  NaN

A similar syntax exists for lists, too:列表也存在类似的语法:

>>> mylist = [ [1,2,3], [4,5], 6 ]

>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])

>>> list_df

     0    1    2
0  1.0  2.0  3.0
1  4.0  5.0  NaN
2  6.0  NaN  NaN

Another syntax for lists is:列表的另一种语法是:

>>> mylist = [ [1,2,3], [4,5], 6 ]

>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })

>>> list_df

   0    1    2
0  1  4.0  6.0
1  2  5.0  NaN
2  3  NaN  NaN

You may additionally have to transpose the result and/or change the column data types (float, integer, etc).您可能还需要转置结果和/或更改列数据类型(浮点数、整数等)。

While this does not directly answer the OP's question.虽然这并不能直接回答 OP 的问题。 I found this to be an excellent solution for my case when I had unequal arrays and I'd like to share:当我有不相等的数组并且我想分享时,我发现这对我的情况来说是一个很好的解决方案:

from pandas documentation 来自熊猫文档

In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
   ....:      'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   ....: 

In [32]: df = DataFrame(d)

In [33]: df
Out[33]: 
   one  two
a    1    1
b    2    2
c    3    3
d  NaN    4

You can also use pd.concat along axis=1 with a list of pd.Series objects:您还可以将pd.concat沿axis=1pd.Series对象列表一起使用:

import pandas as pd, numpy as np

d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}

res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)

print(res)

     A  B
0  1.0  1
1  2.0  2
2  NaN  3
3  NaN  4

Both the following lines work perfectly :以下两行都可以完美运行:

pd.DataFrame.from_dict(df, orient='index').transpose() #A

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)

But with %timeit on Jupyter, I've got a ratio of 4x speed for B vs A, which is quite impressive especially when working with a huge data set (mainly with a big number of columns/features).但是在 Jupyter 上使用 %timeit 时,B 与 A 的速度比为 4 倍,这非常令人印象深刻,尤其是在处理大量数据集(主要是具有大量列/特征)时。

Use pandas.DataFrame and pandas.concat使用pandas.DataFramepandas.concat

  • The following code will create a list of DataFrames with pandas.DataFrame , from a dict of uneven arrays , and then concat the arrays together in a list-comprehension.下面的代码将创建一个listDataFramespandas.DataFrame ,从dict不均匀的arrays ,然后concat在列表-理解阵列在一起。
    • This is a way to create a DataFrame of arrays , that are not equal in length.这是一种创建长度不相等的arrays DataFrame的方法。
    • For equal length arrays , use df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})对于等长arrays ,使用df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
import pandas as pd
import numpy as np


# create the uneven arrays
mu, sigma = 200, 25
np.random.seed(365)
x1 = mu + sigma * np.random.randn(10, 1)
x2 = mu + sigma * np.random.randn(15, 1)
x3 = mu + sigma * np.random.randn(20, 1)

data = {'x1': x1, 'x2': x2, 'x3': x3}

# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)

Use pandas.DataFrame and itertools.zip_longest使用pandas.DataFrameitertools.zip_longest

  • For iterables of uneven length, zip_longest fills missing values with the fillvalue .对于长短不均的iterables, zip_longest填充缺失的价值观fillvalue
  • The zip generator needs to be unpacked, because the DataFrame constructor won't unpack it. zip 生成器需要解压,因为DataFrame构造函数不会解压它。
from itertools import zip_longest

# zip all the values together
zl = list(zip_longest(*data.values()))

# create dataframe
df = pd.DataFrame(zl, columns=data.keys())

plot阴谋

df.plot(marker='o', figsize=[10, 5])

在此处输入图片说明

dataframe数据框

           x1         x2         x3
0   232.06900  235.92577  173.19476
1   176.94349  209.26802  186.09590
2   194.18474  168.36006  194.36712
3   196.55705  238.79899  218.33316
4   249.25695  167.91326  191.62559
5   215.25377  214.85430  230.95119
6   232.68784  240.30358  196.72593
7   212.43409  201.15896  187.96484
8   188.97014  187.59007  164.78436
9   196.82937  252.67682  196.47132
10        NaN  223.32571  208.43823
11        NaN  209.50658  209.83761
12        NaN  215.27461  249.06087
13        NaN  210.52486  158.65781
14        NaN  193.53504  199.10456
15        NaN        NaN  186.19700
16        NaN        NaN  223.02479
17        NaN        NaN  185.68525
18        NaN        NaN  213.41414
19        NaN        NaN  271.75376

If you don't want it to show NaN and you have two particular lengths, adding a 'space' in each remaining cell would also work.如果您不希望它显示NaN并且您有两个特定的长度,则在每个剩余的单元格中添加一个“空格”也可以。

import pandas

long = [6, 4, 7, 3]
short = [5, 6]

for n in range(len(long) - len(short)):
    short.append(' ')

df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()

   A  B
0  6  5
1  4  6
2  7   
3  3   

If you have more than 2 lengths of entries, it is advisable to make a function which uses a similar method.如果您有超过 2 个长度的条目,建议创建一个使用类似方法的函数。

Say I have a dictionary with 10 key-value pairs.假设我有一个包含 10 个键值对的字典。 Each entry holds a numpy array.每个条目都包含一个 numpy 数组。 However, the length of the array is not the same for all of them.但是,所有数组的长度并不相同。

How can I create a dataframe where each column holds a different entry?如何创建一个数据框,其中每列都包含不同的条目?

When I try:当我尝试:

pd.DataFrame(my_dict)

I get:我得到:

ValueError: arrays must all be the same length

Any way to overcome this?有什么办法可以克服这个吗? I am happy to have Pandas use NaN to pad those columns for the shorter entries.我很高兴 Pandas 使用NaN为较短的条目填充这些列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM