如何高效地将 Google BigTable 中的行读入 pandas DataFrame

Question

用例：

我正在使用 Google BigTable 来存储这样的计数：

|  rowkey  |    columnfamily    |
|          | col1 | col2 | col3 |
|----------|------|------|------|
| row1     | 1    | 2    | 3    |
| row2     | 2    | 4    | 8    |
| row3     | 3    | 3    | 3    |

我想读取给定范围的行键的所有行（在这种情况下我们假设全部）并聚合每列的值。

天真的实现会查询行并在聚合计数时迭代行，如下所示：

from google.cloud.bigtable import Client

instance = Client(project='project').instance('my-instance')
table = instance.table('mytable')

col1_sum = 0
col2_sum = 0
col3_max = 0

table.read_rows()
row_data.consume_all()

for row in row_data.rows:
    col1_sum += int.from_bytes(row['columnfamily']['col1'.encode('utf-8')][0].value(), byteorder='big')
    col2_sum += int.from_bytes(row['columnfamily']['col2'.encode('utf-8')][0].value(), byteorder='big')
    col3_value = int.from_bytes(row['columnfamily']['col3'.encode('utf-8')][0].value(), byteorder='big')
    col3_max = col3_value if col3_value > col3_max else col3_max

问题：

有没有办法有效地加载 pandas DataFrame 中的结果行并利用 pandas 性能进行聚合？

我想避免使用 for 循环来计算聚合，因为众所周知它效率很低。

我知道Apache Arrow 项目及其python 绑定，虽然 HBase 被提及为支持项目（Google BigTable 被宣传为与 HBase 非常相似），但我似乎无法找到将其用于用例的方法我在这里描述。

Answer 1

在深入探讨BigTable机制之后，当您调用table.read_rows()时，似乎python客户端会执行gRPC ReadRows调用。 该gRPC调用通过HTTP / 2 以键顺序返回行的流响应（请参阅docs ）。

如果API每行返回数据，在我看来，消耗该响应的唯一有用方法是基于行。 试图以列格式加载数据似乎没有什么用处，从而避免了循环遍历行。

Answer 2

我认为Cloud Bigtable没有现有的pandas界面，但这将是一个不错的项目，类似于https://github.com/pydata/pandas-gbq中的BigQuery界面。

Answer 3

您也许可以将pdhbase与google-cloud-happybase一起使用。 如果无法立即使用，您也许可以从如何执行集成中获得启发。

还有Cloud Bigtable / BigQuery集成，您可以将其与https://github.com/pydata/pandas-gbq集成（感谢Wes McKinney的技巧）。

Answer 4

您可以遍历 BigTable 行并将其存储在字典中以将其合并到 dataframe。下面是获取单行的示例。

import pandas as pd
from google.cloud import bigtable

client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
table = instance.table(table_id)
row_key = "1234"
row = table.read_row(row_key)

dct={}
print("Reading data for {}:".format(row.row_key.decode("utf-8")))
for cf, cols in sorted(row.cells.items()):
    print("Column Family {}".format(cf))
    for col, cells in sorted(cols.items()):
        for cell in cells:
            labels = (
                " [{}]".format(",".join(cell.labels)) if len(cell.labels) else ""
            )
            dct[col.decode("utf-8")] = cell.value.decode("utf-8")
pd.DataFrame([dct])

如何高效地将 Google BigTable 中的行读入 pandas DataFrame

问题描述

4 个解决方案

解决方案1
2 2018-02-21 21:13:43

解决方案2
1 2018-02-19 16:17:06

解决方案3
1 2018-02-19 20:14:07

解决方案4
0 2022-05-17 15:53:54

如何高效地将 Google BigTable 中的行读入 pandas DataFrame

问题描述

4 个解决方案

解决方案1 2 2018-02-21 21:13:43

解决方案2 1 2018-02-19 16:17:06

解决方案3 1 2018-02-19 20:14:07

解决方案4 0 2022-05-17 15:53:54

解决方案1
2 2018-02-21 21:13:43

解决方案2
1 2018-02-19 16:17:06

解决方案3
1 2018-02-19 20:14:07

解决方案4
0 2022-05-17 15:53:54