使用 DataFrame / BigQuery 加速 Python 循环

Question

This loop is currently taking almost 3 hours on my desktop running at 5ghz (OC).这个循环目前在我以 5ghz (OC) 运行的桌面上花费了将近 3 个小时。 How would I go about speeding it up?我将如何加速它 go ？

df = pd.DataFrame(columns=['clientId', 'url', 'count'])

idx = 0
for row in rows:
    df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
    idx += 1

Rows is JSON data stored in (BigQuery) RowIterator. Rows 是存储在 (BigQuery) RowIterator 中的 JSON 数据。

<google.cloud.bigquery.table.RowIterator object at 0x000001ADD93E7B50>
<class 'google.cloud.bigquery.table.RowIterator'>

JSON data looks like: JSON 数据如下：

Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/index.html', 45), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 65), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/index.html', 64), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/products.html', 56), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/employees.html', 54), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact/cookies.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 91), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-ca/careers.html', 42), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/contact.html', 44), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/', 115), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/suppliers', 51), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-us/search.html', 60), {'clientId': 0, 'pagePath': 1, 'count': 2})
Row(('xxxxxxxxxx.xxxxxxxxxx', '/en-au/careers.html', 50), {'clientId': 0, 'pagePath': 1, 'count': 2})

Answer 1

This is not how you use the pandas dataframe. The dataframe represents data vertically, meaning each column is a Series under the hood, which uses a fixed-sized numpy array (although columns of same data type have their arrays contiguous to others).这不是您使用 pandas dataframe 的方式。dataframe 垂直表示数据，这意味着每一列都是引擎盖下的一个系列，它使用固定大小的 numpy 数组（尽管相同数据类型的列的 arrays 与其他列相邻）。

Everytime you append a new row to the dataframe, every column's array is resized (ie, reallocation) and that itself is expensive.每次你 append 一个新行到 dataframe 时，每一列的数组都会调整大小（即重新分配），这本身就是昂贵的。 You are doing this for every row meaning you have n iterations of array reallocations for each column of a unique datatype and this is extremely inefficient.您正在为每一行执行此操作，这意味着您对唯一数据类型的每一列进行了n次数组重新分配迭代，这是非常低效的。 Furthermore, you are also creating a pd.Series for each row, which incurs more allocations that is not useful when the dataframe represents data vertically.此外，您还为每一行创建一个 pd.Series，这会导致更多分配，这在 dataframe 垂直表示数据时没有用。

You can verify this by looking at the id of the columns您可以通过查看列的id来验证这一点

>>> import pandas as pd
>>> df = pd.DataFrame(columns=['clientId', 'url', 'count'])

# Look at the ID of the DataFrame and the columns
>>> id(df)
1494628715776

# These are the IDs of the empty Series for each column
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494628789264, 1494630670400, 1494630670640)

# Assigning a series at an index that didn't exist before
>>> df.loc[0] = pd.Series({'clientId': 123, 'url': 123, 'count': 100})

# ID of the dataframe remains the same
>>> id(df)
1494628715776

# However, the underlying Series objects are different (newly allocated)
>>> id(df['clientId']), id(df['url']), id(df['count'])
(1494630712656, 1494630712176, 1494630712272)

By iteratively adding a new row, you are re-creating new Series objects every iteration, hence why it is slow.通过迭代添加新行，您每次迭代都会重新创建新的 Series 对象，因此速度很慢。 This is also warned in the pandas documentation under the .append() method (the argument holds although it is deprecated): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append这也在.append()方法下的 pandas 文档中发出警告（尽管不推荐使用该参数，但该参数仍然存在）： https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas.882790638195928884

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate.迭代地将行附加到 DataFrame 可能比单个连接的计算密集度更高。 A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.更好的解决方案是将这些行 append 到一个列表，然后将列表与原始 DataFrame 一起连接起来。

You'd be better off doing the iterations and appending into a data structure more suited for dynamic-sized operations like the native Python list before calling pd.DataFrame on it.在调用pd.DataFrame之前，您最好进行迭代并附加到更适合动态大小操作的数据结构中，例如本机 Python list 。 For your simple case, however, you can just pass a generator into the pd.DataFrame call:但是，对于简单的情况，您可以将生成器传递到pd.DataFrame调用中：

# No need to specify columns since you provided the dictionary with the keys
df = pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)

To demonstrate the difference in jupyter notebook:演示 jupyter notebook 的区别：

def reallocating_way(rows):
    df = pd.DataFrame(columns=['clientId', 'url', 'count'])
    for idx, row in enumerate(rows):
        df.loc[idx] = pd.Series({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count})
    return df

def better_way(rows):
    return pd.DataFrame({'clientId': row.clientId, 'url': row.pagePath, 'count': row.count} for row in rows)

# Making an arbitrary list of 1000 rows
rows = [Row() for _ in range(1000)]

%timeit reallocating_way(rows)
%timeit better_way(rows)

2.45 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# Making an arbitrary list of 10000 rows
rows = [Row() for _ in range(10000)]

%timeit reallocating_way(rows)
%timeit better_way(rows)

27.3 s ± 1.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.4 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

more than 1000x faster for 1000 rows and more than 2000x faster for 10000 rows 1000 行快 1000 倍以上，10000 行快 2000 倍以上

Answer 2

I ran across the to_dataframe() method in BigQuery.我在 BigQuery 中遇到了 to_dataframe() 方法。 Extremely fast.极快。 Took 3 hours down to 3 seconds.将 3 小时缩短为 3 秒。

df = query_job.result().to_dataframe()

google.cloud.bigquery.table.RowIterator google.cloud.bigquery.table.RowIterator

Downloading BigQuery data to pandas using the BigQuery Storage API 使用 BigQuery 存储 API 将 BigQuery 数据下载到 pandas

使用 DataFrame / BigQuery 加速 Python 循环

问题描述

2 个解决方案

解决方案1
1 2022-04-15 16:54:45

解决方案2
0 已采纳 2022-04-15 13:31:02

使用 DataFrame / BigQuery 加速 Python 循环

问题描述

2 个解决方案

解决方案1 1 2022-04-15 16:54:45

解决方案2 0 已采纳 2022-04-15 13:31:02

解决方案1
1 2022-04-15 16:54:45

解决方案2
0 已采纳 2022-04-15 13:31:02