[英]Looking for a faster way to create a new column in a data frame containing a dictionary values from the rows of another column
目前我已經通過 using.apply() 和 lambda function 實現了標題:
calFilteredDf['startTime'] = calFilteredDf['start'].apply(lambda x: x['dateTime'])
這非常慢,我想知道如何在更短的時間內獲得相同的結果。 calFilteredDf['start'] 是 Pandas 系列,'start' 列中的數據如下所示:
1 {'date': None, 'dateTime': '2021-08-11T15:00:0...
2 {'date': None, 'dateTime': '2021-08-12T09:30:0...
3 {'date': None, 'dateTime': '2021-08-12T10:00:0...
4 {'date': None, 'dateTime': '2021-08-18T11:00:0...
...
1692 {'date': None, 'dateTime': '2023-08-09T14:00:0...
1693 {'date': None, 'dateTime': '2023-08-09T15:00:0...
1694 {'date': None, 'dateTime': '2023-08-10T11:30:0...
1695 {'date': None, 'dateTime': '2023-08-10T16:00:0...
1696 {'date': None, 'dateTime': '2023-08-10T17:00:0...
Name: start, Length: 1697, dtype: object
新的“startTime”列中的數據需要如下所示:
1 2021-08-11T15:00:00-04:00
2 2021-08-12T09:30:00-04:00
3 2021-08-12T10:00:00-04:00
4 2021-08-18T11:00:00-04:00
...
1692 2023-08-09T14:00:00-04:00
1693 2023-08-09T15:00:00-04:00
1694 2023-08-10T11:30:00-04:00
1695 2023-08-10T16:00:00-04:00
1696 2023-08-10T17:00:00-04:00
Name: startTime, Length: 1697, dtype: object
有沒有辦法快速做到這一點? 我試圖設置
calFilteredDf['startTime'] = calFilteredDf['startTime']['dateTime']
我也嘗試過 using.loc ,但它不起作用,因為“開始”的行不是正確的數據類型,我嘗試使用 swifter 庫來並行化 that.apply() 正在執行的過程,但因為數據集不是很大,它實際上使它變慢了,因為庫執行了額外的步驟來確定處理數據的最佳方法是什么。
pd.json_normalize 使用起來更方便,但結果卻是最慢的。 列表生成器已成為最快的。 下面是 pd.json_normalize 代碼。 並使用不同的方法進行測試。
import pandas as pd
aaa = [[{'date': None, 'dateTime': '2021-08-12T09:30'}],
[{'date': None, 'dateTime': '2021-08-12T10:00'}],
[{'date': None, 'dateTime': '2021-08-18T11:00'}],
[{'date': None, 'dateTime': '2023-08-09T14:00'}],
[{'date': None, 'dateTime': '2023-08-09T15:00'}],
[{'date': None, 'dateTime': '2023-08-10T11:30'}],
[{'date': None, 'dateTime': '2023-08-10T16:00'}],
[{'date': None, 'dateTime': '2023-08-10T17:00'}]]
calFilteredDf = pd.DataFrame(aaa)
print(calFilteredDf)
calFilteredDf = pd.json_normalize(calFilteredDf[0])
calFilteredDf['startTime'] = calFilteredDf['dateTime']
print(calFilteredDf)
輸入
0
0 {'date': None, 'dateTime': '2021-08-12T09:30'}
1 {'date': None, 'dateTime': '2021-08-12T10:00'}
2 {'date': None, 'dateTime': '2021-08-18T11:00'}
3 {'date': None, 'dateTime': '2023-08-09T14:00'}
4 {'date': None, 'dateTime': '2023-08-09T15:00'}
5 {'date': None, 'dateTime': '2023-08-10T11:30'}
6 {'date': None, 'dateTime': '2023-08-10T16:00'}
7 {'date': None, 'dateTime': '2023-08-10T17:00'}
Output
date dateTime startTime
0 None 2021-08-12T09:30 2021-08-12T09:30
1 None 2021-08-12T10:00 2021-08-12T10:00
2 None 2021-08-18T11:00 2021-08-18T11:00
3 None 2023-08-09T14:00 2023-08-09T14:00
4 None 2023-08-09T15:00 2023-08-09T15:00
5 None 2023-08-10T11:30 2023-08-10T11:30
6 None 2023-08-10T16:00 2023-08-10T16:00
7 None 2023-08-10T17:00 2023-08-10T17:00
是的,確實 json_normalize 慢了一倍。 下面是使用 apply、json_normalize、transform、list 生成器的代碼。
now = datetime.datetime.now()
for i in range(10000):
calFilteredDf[0].apply(lambda x: x['dateTime'])
time_ = datetime.datetime.now() - now
print('apply', time_)
now = datetime.datetime.now()
for i in range(10000):
pd.json_normalize(calFilteredDf[0])
time_ = datetime.datetime.now() - now
print('json_normalize', time_)
now = datetime.datetime.now()
for i in range(10000):
calFilteredDf[0].transform(lambda x: x['dateTime'])
time_ = datetime.datetime.now() - now
print('transform', time_)
now = datetime.datetime.now()
for i in range(10000):
a = [i['dateTime'] for i in calFilteredDf[0]]
time_ = datetime.datetime.now() - now
print('list generator', time_)
Output
apply 0:00:01.707580
json_normalize 0:00:03.666553
transform 0:00:01.896933
list generator 0:00:00.056657
output 是使用列表生成器。
apply
、 transform
和list comprehension
方法在大型數據集(超過 2000 行)上產生相似的速度,而且它們都非常快。 在較小的數據集(尤其是 < 1000 行)上,列表理解優於其他方法。
使用perfplot
package 的時序:
def gen(n):
return pd.Series([{'date': None, 'dateTime': '2021-08-12T09:30'}] * n)
def using_apply(s):
return s.apply(lambda x: x['dateTime'])
def using_transform(s):
return s.transform(lambda x: x['dateTime'])
def using_list_comprehension(s):
return pd.Series([i['dateTime'] for i in s])
import perfplot
perfplot.plot(
setup=gen,
kernels=[using_apply, using_transform, using_list_comprehension],
n_range=[2**k for k in range(4, 22)],
equality_check=None
)
在較小的數據集上與ps.json_normalize()
進行比較:
def using_json_normalize(s):
return pd.json_normalize(s)['dateTime']
perfplot.plot(
setup=gen,
kernels=[using_apply, using_transform, using_list_comprehension, using_json_normalize],
n_range=[2**k for k in range(4, 12)],
equality_check=None
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.