简体   繁体   English

Bigtable CellsColumnLimitFilter 和 ValueRangeFilter 未按预期工作

[英]Bigtable CellsColumnLimitFilter and ValueRangeFilter not working as intended

I'm new to Bigtable and I've been testing out the filtering features based on this documentation.我是 Bigtable 的新手,我一直在测试基于此文档的过滤功能。 https://cloud.google.com/bigtable/docs/using-filters https://cloud.google.com/bigtable/docs/using-filters

I've tried this in this repo under testcsvbigtablefilters.py which I have some problems.我已经在 testcsvbigtablefilters.py 下的这个 repo 中尝试过这个,但我遇到了一些问题。 For the record, I am testing this on Bigtable Emulator on my local machine https://github.com/limjix/GoogleCloudBigDataTest作为记录,我正在本地机器https://github.com/limjix/GoogleCloudBigDataTest上的 Bigtable Emulator 上进行测试

I have some issues with 2 filters:我有 2 个过滤器的一些问题:

  1. CellsColumnLimitFilter (Line 100-110, uncomment to try) This filter doesn't seem to be working. CellsColumnLimitFilter(第 100-110 行,取消注释以尝试)此过滤器似乎不起作用。 I have a table of 3 columns but when I put CellsColumnLimitFilter(1) or CellsColumnLimitFilter(2), I still get all 3 columns?我有一个包含 3 列的表格,但是当我放置 CellsColumnLimitFilter(1) 或 CellsColumnLimitFilter(2) 时,我仍然得到所有 3 列吗?
     rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(10))
     for row in rows:
         print_row(row)
  1. ValueRangeFilter (Line 172-183) This filter doesn't work because, for example I have the 1st column with values around 10000 to 14000. When I put the start value at 10,000 to 14,000, nothing comes up. ValueRangeFilter(第 172-183 行)此过滤器不起作用,因为例如我的第一列的值在 10000 到 14000 之间。当我将起始值设置为 10,000 到 14,000 时,什么也没有出现。 When I put 0 to 3, the 10k to 14k value start showing up.当我把 0 放到 3 时,10k 到 14k 的值开始出现。 This makes no sense and the filter does not seem to work at all这没有任何意义,过滤器似乎根本不起作用
    rows = table.read_rows(
        filter_=row_filters.ValueRangeFilter(start_value=b'0',end_value=b'3'))

    for row in rows:
        print_row(row)

Also, I want to ask, how do I query for cells that have been overwritten?另外我想问一下,如何查询被覆盖的单元格? I know bigtable saves the mutation of cells over a period of time.我知道bigtable保存了一段时间内细胞的变异。 How do I query and filter for a specific time for that cell?如何查询和过滤该单元格的特定时间?

Any help would be appreciated guys, there isn't much tutorials or documentation anywhere else so I hope the community can help.任何帮助将不胜感激,其他任何地方都没有太多教程或文档,所以我希望社区可以提供帮助。

Thank you!谢谢!

Hey limjix thanks for the questions, The Bigtable filtering documentation has just been introduced.嘿 limjix 感谢您提出问题,刚刚介绍了 Bigtable 过滤文档。 so questions like this can help us improve it moving forward.所以像这样的问题可以帮助我们改进它。

  1. CellsColumnLimitFilter

The CellsColumnLimitFilter limits the number of cells in each column that are included in the output row. CellsColumnLimitFilter限制每列中包含在 output 行中的单元格数。 In the documentation it is listed as cells per column filter , so I can see how this function name would be a bit confusing.在文档中它被列为cells per column filter ,所以我可以看到这个 function 名称会有点混乱。

If you only have a row with a few columns that only have one cell or one version for those values, then CellsColumnLimitFilter would return all of them.如果一行只有几列,这些列只有一个单元格或一个版本,则CellsColumnLimitFilter将返回所有这些值。 If you're looking to only receive one of the column's data you can use the CellsRowLimitFilter which filters on cells per row.如果您只想接收列的一个数据,您可以使用CellsRowLimitFilter过滤每行的单元格。 Or you could specify specific columns with any of the column qualifier filters.或者您可以使用任何列限定符过滤器指定特定列。

  1. ValueRangeFilter

I did some digging and I believe I know what your issue here is, but I'm not 100%.我做了一些挖掘,我相信我知道你的问题是什么,但我不是 100%。 And am happy to troubleshoot further with you if you need.如果您需要,我很乐意与您一起进一步解决问题。

It looks like you set the cell values directly from the CSV:看起来您直接从 CSV 设置单元格值:

bigtablerow.set_cell(column_family_id,
             "column1",
             #str(float(csvrow[20])+i),
             csvrow[20],
             timestamp=datetime.datetime.utcnow())

This works fine for string, but if you set a cell value to a number, the python client will treat that as an incrementable value and encode it as a 64-bit big-endian signed integer and will not be comparable with the b'10000' that you have.这适用于字符串,但如果您将单元格值设置为数字,python 客户端会将其视为可递增的值并将其编码为 64 位大端签名 integer,并且无法与 b'10000 进行比较'你有。

For example this code wont return any rows:例如,此代码不会返回任何行:

client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
table = instance.table(table_id)

rows = []
for i in range(10):
    row_key = 'test_num{}'.format(i).encode()
    row = table.direct_row(row_key)
    row.set_cell("cf".encode(),
                 "col".encode(),
                 random.randint(10000, 14000)
                 )
    rows.append(row)
table.mutate_rows(rows)


rows = table.read_rows(
    filter_=row_filters.ValueRangeFilter(start_value=b'10000',
                                         end_value=b'14000'))
for row in rows:
    print(row)

But when I turn the integer into a string I get all the rows但是当我把 integer 变成一个字符串时,我得到了所有的行

    row.set_cell("cf".encode(),
                 "col".encode(),
                 str(random.randint(10000, 14000))
                 )
    rows.append(row)

I would recommend using the CBT tool to check out what the data looks like.我建议使用 CBT 工具来检查数据的外观。 For example, when I do cbt read the first rows look like this:例如,当我执行cbt read时,第一行如下所示:

test_num0
  cf:col                                   @ 2020/09/20-22:24:43.835000
    "\x00\x00\x00\x00\x00\x003\xd1"
----------------------------------------
test_num1
  cf:col                                   @ 2020/09/20-22:24:43.835000
    "\x00\x00\x00\x00\x00\x001t"

whereas the string ones look like this (they have two values since I used the same rowkeys actually, but you should get the point):而字符串看起来像这样(它们有两个值,因为我实际上使用了相同的行键,但你应该明白这一点):

test_num0
  cf:col                                   @ 2020/09/20-22:36:12.075000
    "11510"
  cf:col                                   @ 2020/09/20-22:24:43.835000
    "\x00\x00\x00\x00\x00\x003\xd1"
----------------------------------------
test_num1
  cf:col                                   @ 2020/09/20-22:36:12.075000
    "12048"
  cf:col                                   @ 2020/09/20-22:24:43.835000
    "\x00\x00\x00\x00\x00\x001t"
  1. Querying for cells that have been overwritten查询已被覆盖的单元格

This is similar to what you were trying to do in the first example.这类似于您在第一个示例中尝试执行的操作。 For cells that have been overwritten you're looking for filters around cell timestamps ( TimestampRangeFilter ) or number of cells per column ( CellsColumnLimitFilter again).对于已被覆盖的单元格,您正在寻找围绕单元格时间戳 ( TimestampRangeFilter ) 或每列单元格数(再次为CellsColumnLimitFilter )的过滤器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM