簡體   English   中英

尋找一種更快的方法在數據框中創建新列,其中包含來自另一列行的字典值

[英]Looking for a faster way to create a new column in a data frame containing a dictionary values from the rows of another column

目前我已經通過 using.apply() 和 lambda function 實現了標題:

calFilteredDf['startTime'] = calFilteredDf['start'].apply(lambda x: x['dateTime'])

這非常慢,我想知道如何在更短的時間內獲得相同的結果。 calFilteredDf['start'] 是 Pandas 系列,'start' 列中的數據如下所示:

1       {'date': None, 'dateTime': '2021-08-11T15:00:0...
2       {'date': None, 'dateTime': '2021-08-12T09:30:0...
3       {'date': None, 'dateTime': '2021-08-12T10:00:0...
4       {'date': None, 'dateTime': '2021-08-18T11:00:0...
                              ...                        
1692    {'date': None, 'dateTime': '2023-08-09T14:00:0...
1693    {'date': None, 'dateTime': '2023-08-09T15:00:0...
1694    {'date': None, 'dateTime': '2023-08-10T11:30:0...
1695    {'date': None, 'dateTime': '2023-08-10T16:00:0...
1696    {'date': None, 'dateTime': '2023-08-10T17:00:0...
Name: start, Length: 1697, dtype: object

新的“startTime”列中的數據需要如下所示:

1       2021-08-11T15:00:00-04:00
2       2021-08-12T09:30:00-04:00
3       2021-08-12T10:00:00-04:00
4       2021-08-18T11:00:00-04:00
                  ...            
1692    2023-08-09T14:00:00-04:00
1693    2023-08-09T15:00:00-04:00
1694    2023-08-10T11:30:00-04:00
1695    2023-08-10T16:00:00-04:00
1696    2023-08-10T17:00:00-04:00
Name: startTime, Length: 1697, dtype: object

有沒有辦法快速做到這一點? 我試圖設置

calFilteredDf['startTime'] = calFilteredDf['startTime']['dateTime']

我也嘗試過 using.loc ,但它不起作用,因為“開始”的行不是正確的數據類型,我嘗試使用 swifter 庫來並行化 that.apply() 正在執行的過程,但因為數據集不是很大,它實際上使它變慢了,因為庫執行了額外的步驟來確定處理數據的最佳方法是什么。

pd.json_normalize 使用起來更方便,但結果卻是最慢的。 列表生成器已成為最快的。 下面是 pd.json_normalize 代碼。 並使用不同的方法進行測試。

import pandas as pd

aaa = [[{'date': None, 'dateTime': '2021-08-12T09:30'}],
[{'date': None, 'dateTime': '2021-08-12T10:00'}],
[{'date': None, 'dateTime': '2021-08-18T11:00'}],
[{'date': None, 'dateTime': '2023-08-09T14:00'}],
[{'date': None, 'dateTime': '2023-08-09T15:00'}],
[{'date': None, 'dateTime': '2023-08-10T11:30'}],
[{'date': None, 'dateTime': '2023-08-10T16:00'}],
[{'date': None, 'dateTime': '2023-08-10T17:00'}]]


calFilteredDf = pd.DataFrame(aaa)

print(calFilteredDf)

calFilteredDf = pd.json_normalize(calFilteredDf[0])
calFilteredDf['startTime'] = calFilteredDf['dateTime']

print(calFilteredDf)

輸入

                                                0
0  {'date': None, 'dateTime': '2021-08-12T09:30'}
1  {'date': None, 'dateTime': '2021-08-12T10:00'}
2  {'date': None, 'dateTime': '2021-08-18T11:00'}
3  {'date': None, 'dateTime': '2023-08-09T14:00'}
4  {'date': None, 'dateTime': '2023-08-09T15:00'}
5  {'date': None, 'dateTime': '2023-08-10T11:30'}
6  {'date': None, 'dateTime': '2023-08-10T16:00'}
7  {'date': None, 'dateTime': '2023-08-10T17:00'}

Output

   date          dateTime         startTime
0  None  2021-08-12T09:30  2021-08-12T09:30
1  None  2021-08-12T10:00  2021-08-12T10:00
2  None  2021-08-18T11:00  2021-08-18T11:00
3  None  2023-08-09T14:00  2023-08-09T14:00
4  None  2023-08-09T15:00  2023-08-09T15:00
5  None  2023-08-10T11:30  2023-08-10T11:30
6  None  2023-08-10T16:00  2023-08-10T16:00
7  None  2023-08-10T17:00  2023-08-10T17:00

是的,確實 json_normalize 慢了一倍。 下面是使用 apply、json_normalize、transform、list 生成器的代碼。

now = datetime.datetime.now()
for i in range(10000):
    calFilteredDf[0].apply(lambda x: x['dateTime'])

time_ = datetime.datetime.now() - now
print('apply', time_)

now = datetime.datetime.now()
for i in range(10000):
    pd.json_normalize(calFilteredDf[0])

time_ = datetime.datetime.now() - now
print('json_normalize', time_)


now = datetime.datetime.now()
for i in range(10000):
    calFilteredDf[0].transform(lambda x: x['dateTime'])

time_ = datetime.datetime.now() - now
print('transform', time_)

now = datetime.datetime.now()
for i in range(10000):
    a = [i['dateTime'] for i in calFilteredDf[0]]

time_ = datetime.datetime.now() - now
print('list generator', time_)

Output

apply 0:00:01.707580
json_normalize 0:00:03.666553
transform 0:00:01.896933
list generator 0:00:00.056657

output 是使用列表生成器。

applytransformlist comprehension方法在大型數據集(超過 2000 行)上產生相似的速度,而且它們都非常快。 在較小的數據集(尤其是 < 1000 行)上,列表理解優於其他方法。

使用perfplot package 的時序:

def gen(n):
    return pd.Series([{'date': None, 'dateTime': '2021-08-12T09:30'}] * n)
def using_apply(s):
    return s.apply(lambda x: x['dateTime'])

def using_transform(s):
    return s.transform(lambda x: x['dateTime'])

def using_list_comprehension(s):
    return pd.Series([i['dateTime'] for i in s])
import perfplot

perfplot.plot(
    setup=gen,
    kernels=[using_apply, using_transform, using_list_comprehension],
    n_range=[2**k for k in range(4, 22)],
    equality_check=None
)

在較小的數據集上與ps.json_normalize()進行比較:

def using_json_normalize(s):
    return pd.json_normalize(s)['dateTime']

perfplot.plot(
    setup=gen,
    kernels=[using_apply, using_transform, using_list_comprehension, using_json_normalize],
    n_range=[2**k for k in range(4, 12)],
    equality_check=None
)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM