使用 DataFrame / BigQuery 加速 Python 循環

Question

這個循環目前在我以 5ghz (OC) 運行的桌面上花費了將近 3 個小時。 我將如何加速它 go ？

df = pd.DataFrame(columns=['clientId', 'url', 'count'])

idx = 0
for row in rows:
    df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
    idx += 1

Rows 是存儲在 (BigQuery) RowIterator 中的 JSON 數據。

<google.cloud.bigquery.table.RowIterator object at 0x000001ADD93E7B50>
<class 'google.cloud.bigquery.table.RowIterator'>

JSON 數據如下：

Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/index.html', 45), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 65), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/index.html', 64), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/products.html', 56), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/employees.html', 54), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact/cookies.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 91), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-ca/careers.html', 42), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/', 115), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/suppliers', 51), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/search.html', 60), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 50), {'clientId': 0, 'pagePath': 1, 'count': 2})

Answer 1

這不是您使用 pandas dataframe 的方式。dataframe 垂直表示數據，這意味着每一列都是引擎蓋下的一個系列，它使用固定大小的 numpy 數組（盡管相同數據類型的列的 arrays 與其他列相鄰）。

每次你 append 一個新行到 dataframe 時，每一列的數組都會調整大小（即重新分配），這本身就是昂貴的。 您正在為每一行執行此操作，這意味着您對唯一數據類型的每一列進行了n次數組重新分配迭代，這是非常低效的。 此外，您還為每一行創建一個 pd.Series，這會導致更多分配，這在 dataframe 垂直表示數據時沒有用。

您可以通過查看列的id來驗證這一點

>>> import pandas as pd
>>> df = pd.DataFrame(columns=['clientId', 'url', 'count'])

# Look at the ID of the DataFrame and the columns
>>> id(df)
1494628715776

# These are the IDs of the empty Series for each column
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494628789264, 1494630670400, 1494630670640)

# Assigning a series at an index that didn't exist before
>>> df.loc[0] = pd.Series({'clientId': 123, 'url': 123, 'count': 100})

# ID of the dataframe remains the same
>>> id(df)
1494628715776

# However, the underlying Series objects are different (newly allocated)
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494630712656, 1494630712176, 1494630712272)

通過迭代添加新行，您每次迭代都會重新創建新的 Series 對象，因此速度很慢。 這也在.append()方法下的 pandas 文檔中發出警告（盡管不推薦使用該參數，但該參數仍然存在）： https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas.882790638195928884

迭代地將行附加到 DataFrame 可能比單個連接的計算密集度更高。 更好的解決方案是將這些行 append 到一個列表，然后將列表與原始 DataFrame 一起連接起來。

在調用pd.DataFrame之前，您最好進行迭代並附加到更適合動態大小操作的數據結構中，例如本機 Python list 。 但是，對於簡單的情況，您可以將生成器傳遞到pd.DataFrame調用中：

# No need to specify columns since you provided the dictionary with the keys
df = pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)

演示 jupyter notebook 的區別：

def reallocating_way(rows):
    df = pd.DataFrame(columns=['clientId', 'url', 'count'])
    for idx, row in enumerate(rows):
        df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
    return df

def better_way(rows):
    return pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)

# Making an arbitrary list of 1000 rows
rows = [Row() for _ in range(1000)]

%timeit reallocating_way(rows)
%timeit better_way(rows)

2.45 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# Making an arbitrary list of 10000 rows
rows = [Row() for _ in range(10000)]

%timeit reallocating_way(rows)
%timeit better_way(rows)

27.3 s ± 1.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.4 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1000 行快 1000 倍以上，10000 行快 2000 倍以上

Answer 2

我在 BigQuery 中遇到了 to_dataframe() 方法。 極快。 將 3 小時縮短為 3 秒。

df = query_job.result().to_dataframe()

google.cloud.bigquery.table.RowIterator

使用 BigQuery 存儲 API 將 BigQuery 數據下載到 pandas

使用 DataFrame / BigQuery 加速 Python 循環

問題描述

2 個解決方案

解決方案1
1 2022-04-15 16:54:45

解決方案2
0 已采納 2022-04-15 13:31:02

使用 DataFrame / BigQuery 加速 Python 循環

問題描述

2 個解決方案

解決方案1 1 2022-04-15 16:54:45

解決方案2 0 已采納 2022-04-15 13:31:02

解決方案1
1 2022-04-15 16:54:45

解決方案2
0 已采納 2022-04-15 13:31:02