[英]How can I create a dataframe of dummies from a dict of lists of unequal length?
[英]How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?
我有一個lists
dict
(長度可變),我期待着一種從它創建數據幀的有效方法。
假設我有最小列表長度,所以我可以在創建 Dataframe 時截斷更大列表的大小。
這是我的虛擬代碼
data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3
我可以有一個包含 10k 或 20k 鍵的字典,因此正在尋找一種有效的方法來創建如下所示的 DataFrame
>>> df
a b c
0 1 1 2
1 2 2 45
2 3 3 67
您可以在dict comprehension
過濾dict
values
,然后DataFrame
完美運行:
print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}
df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
a b c
0 1 1 2
1 2 2 45
2 3 3 67
如果可能,一些長度可以小於min_length
add Series
:
data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3
df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
a b c
0 1 1.0 2
1 2 2.0 45
2 3 NaN 67
時間:
In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop
In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop
#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop
計時代碼:
np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000
data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}
單線解決方案:
#Construct the df horizontally and then transpose. Finally drop rows with nan.
pd.DataFrame.from_dict(data_dict,orient='index').T.dropna()
Out[326]:
a b c
0 1.0 1.0 2.0
1 2.0 2.0 45.0
2 3.0 3.0 67.0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.