![](/img/trans.png)
[英]How to calculate the number of elements of a PCollection in Apache beam
[英]How to properly test pcollection length when unit testing Apache Beam
我想知道哪個是測試檢查束流水線產生的 output 長度的最佳方法。
我有一些這樣的測試代碼:
test_data = [
{'kind': 'storage#object', 'name': 'file1.doc', 'contentType': 'application/octet-stream', 'bucket': 'bucket123' },
{'kind': 'storage#object', 'name': 'file2.pdf', 'contentType': 'application/pdf','bucket': 'bucket234'},
{'kind': 'storage#object', 'name': 'file3.msg', 'contentType': 'message/rfc822', 'bucket': 'bucket345'}
]
with TestPipeline() as p:
output = (p
| beam.Create(test_data)
| beam.ParDo(DoFn_To_Test()).with_outputs('ok','error')
)
我想測試確保 test_data 列表中的所有元素 go 到“output.ok”。 我認為這樣做的方法是像這樣計算它們:
with TestPipeline() as p:
output = (p
| beam.Create(testdata)
| beam.ParDo(DoFn_To_Test()).with_outputs('ok','error')
)
okay_count = (output.ok | beam.Map(lambda x: ('dummy_key',x))
| beam.GroupByKey() # This gets ('dumm_key',[element1,element2....])
| beam.Map(lambda x: len(x[1]) ) # Drop the key and get the lengh of the list
)
# And finally check^H^H^H^H^H^H assert the count is correct:
assert_that(okay_count, equal_to([len(test_data)])
這行得通; 但我認為這不是最好的方法,而且我相信還有更多方法可以做到。
這是迄今為止建議的最佳選擇:使用 beam.combiners.Count.Globally()
with TestPipeline() as p:
output = (p
| beam.Create(testdata)
| beam.ParDo(DoFn_To_Test()).with_outputs('ok','error')
)
okay_count = output | beam.combiners.Count.Globally()
assert_that(okay_count, equal_to([len(test_data)])
你在問題中回答了你自己的問題。 寫在這里作為答案:
with TestPipeline() as p:
output = (p
| beam.Create(testdata)
| beam.ParDo(DoFn_To_Test()).with_outputs('ok','error')
)
okay_count = output | beam.combiners.Count.Globally()
assert_that(okay_count, equal_to([len(test_data)])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.