[英]Creating dataframe from a dictionary where entries have different lengths
Say I have a dictionary with 10 key-value pairs.假设我有一个包含 10 个键值对的字典。 Each entry holds a numpy array.每个条目都包含一个 numpy 数组。 However, the length of the array is not the same for all of them.但是,数组的长度对于所有这些都不相同。
How can I create a dataframe where each column holds a different entry?如何创建 dataframe ,其中每列包含不同的条目?
When I try:当我尝试:
pd.DataFrame(my_dict)
I get:我得到:
ValueError: arrays must all be the same length
Any way to overcome this?有什么办法可以克服吗? I am happy to have Pandas use NaN
to pad those columns for the shorter entries.我很高兴 Pandas 使用NaN
填充这些列以获得较短的条目。
In Python 3.x:在 Python 3.x 中:
import pandas as pd
import numpy as np
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
Out[7]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
In Python 2.x:在 Python 2.x 中:
replace d.items()
with d.iteritems()
.用d.items()
替换d.iteritems()
。
Here's a simple way to do that:这是一个简单的方法来做到这一点:
In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]:
0 1 2 3
A 1 2 NaN NaN
B 1 2 3 4
In[23]: df.transpose()
Out[23]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
A way of tidying up your syntax, but still do essentially the same thing as these other answers, is below:一种整理语法但仍然与其他答案基本相同的方法如下:
>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}
>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })
>>> dict_df
one 2 3
0 1.0 4 8.0
1 2.0 5 NaN
2 3.0 6 NaN
3 NaN 7 NaN
A similar syntax exists for lists, too:列表也存在类似的语法:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])
>>> list_df
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Another syntax for lists is:列表的另一种语法是:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })
>>> list_df
0 1 2
0 1 4.0 6.0
1 2 5.0 NaN
2 3 NaN NaN
You may additionally have to transpose the result and/or change the column data types (float, integer, etc).您可能还需要转置结果和/或更改列数据类型(浮点数、整数等)。
While this does not directly answer the OP's question.虽然这并不能直接回答 OP 的问题。 I found this to be an excellent solution for my case when I had unequal arrays and I'd like to share:当我有不相等的数组并且我想分享时,我发现这对我的情况来说是一个很好的解决方案:
from pandas documentation 来自熊猫文档
In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [32]: df = DataFrame(d)
In [33]: df
Out[33]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4
You can also use pd.concat
along axis=1
with a list of pd.Series
objects:您还可以将pd.concat
沿axis=1
与pd.Series
对象列表一起使用:
import pandas as pd, numpy as np
d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}
res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)
print(res)
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
Both the following lines work perfectly :以下两行都可以完美运行:
pd.DataFrame.from_dict(df, orient='index').transpose() #A
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)
But with %timeit on Jupyter, I've got a ratio of 4x speed for B vs A, which is quite impressive especially when working with a huge data set (mainly with a big number of columns/features).但是在 Jupyter 上使用 %timeit 时,B 与 A 的速度比为 4 倍,这非常令人印象深刻,尤其是在处理大量数据集(主要是具有大量列/特征)时。
pandas.DataFrame
and pandas.concat
使用pandas.DataFrame
和pandas.concat
list
of DataFrames
with pandas.DataFrame
, from a dict
of uneven arrays
, and then concat
the arrays together in a list-comprehension.下面的代码将创建一个list
的DataFrames
与pandas.DataFrame
,从dict
不均匀的arrays
,然后concat
在列表-理解阵列在一起。
DataFrame
of arrays
, that are not equal in length.这是一种创建长度不相等的arrays
DataFrame
的方法。arrays
, use df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
对于等长arrays
,使用df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
import pandas as pd
import numpy as np
# create the uneven arrays
mu, sigma = 200, 25
np.random.seed(365)
x1 = mu + sigma * np.random.randn(10, 1)
x2 = mu + sigma * np.random.randn(15, 1)
x3 = mu + sigma * np.random.randn(20, 1)
data = {'x1': x1, 'x2': x2, 'x3': x3}
# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
pandas.DataFrame
and itertools.zip_longest
使用pandas.DataFrame
和itertools.zip_longest
zip_longest
fills missing values with the fillvalue
.对于长短不均的iterables, zip_longest
填充缺失的价值观fillvalue
。DataFrame
constructor won't unpack it. zip 生成器需要解压,因为DataFrame
构造函数不会解压它。from itertools import zip_longest
# zip all the values together
zl = list(zip_longest(*data.values()))
# create dataframe
df = pd.DataFrame(zl, columns=data.keys())
df.plot(marker='o', figsize=[10, 5])
x1 x2 x3
0 232.06900 235.92577 173.19476
1 176.94349 209.26802 186.09590
2 194.18474 168.36006 194.36712
3 196.55705 238.79899 218.33316
4 249.25695 167.91326 191.62559
5 215.25377 214.85430 230.95119
6 232.68784 240.30358 196.72593
7 212.43409 201.15896 187.96484
8 188.97014 187.59007 164.78436
9 196.82937 252.67682 196.47132
10 NaN 223.32571 208.43823
11 NaN 209.50658 209.83761
12 NaN 215.27461 249.06087
13 NaN 210.52486 158.65781
14 NaN 193.53504 199.10456
15 NaN NaN 186.19700
16 NaN NaN 223.02479
17 NaN NaN 185.68525
18 NaN NaN 213.41414
19 NaN NaN 271.75376
If you don't want it to show NaN
and you have two particular lengths, adding a 'space' in each remaining cell would also work.如果您不希望它显示NaN
并且您有两个特定的长度,则在每个剩余的单元格中添加一个“空格”也可以。
import pandas
long = [6, 4, 7, 3]
short = [5, 6]
for n in range(len(long) - len(short)):
short.append(' ')
df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()
A B
0 6 5
1 4 6
2 7
3 3
If you have more than 2 lengths of entries, it is advisable to make a function which uses a similar method.如果您有超过 2 个长度的条目,建议创建一个使用类似方法的函数。
Say I have a dictionary with 10 key-value pairs.假设我有一个包含 10 个键值对的字典。 Each entry holds a numpy array.每个条目都包含一个 numpy 数组。 However, the length of the array is not the same for all of them.但是,所有数组的长度并不相同。
How can I create a dataframe where each column holds a different entry?如何创建一个数据框,其中每列都包含不同的条目?
When I try:当我尝试:
pd.DataFrame(my_dict)
I get:我得到:
ValueError: arrays must all be the same length
Any way to overcome this?有什么办法可以克服这个吗? I am happy to have Pandas use NaN
to pad those columns for the shorter entries.我很高兴 Pandas 使用NaN
为较短的条目填充这些列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.