Apache-Beam + Python：将JSON（或字典）字符串写入输出文件

Question

I am trying to use a Beam pipeline in order to apply the SequenceMatcher function to a ton of words. 我正在尝试使用Beam管道，以便将SequenceMatcher函数应用于大量单词。 I (hopefully) have figured everything out except the WriteToText part. 我（希望）除了WriteToText部分之外已经想出了所有的东西。

I have defined a custom ParDo (hereby called ProcessDataDoFn) that takes the main_input and the side_input, process them and output dictionaries like this one 我已经定义了一个自定义ParDo（在此称为ProcessDataDoFn），它接受main_input和side_input，处理它们并输出像这样的字典

{u'key': (u'string', float)}

My pipeline is quite simple 我的管道非常简单

class ProcessDataDoFn(beam.DoFn):
    def process(self, element, side_input):

    ... Series of operations ...

    return output_dictionary

with beam.Pipeline(options=options) as p:

    # Main input
    main_input = p | 'ReadMainInput' >> beam.io.Read(
        beam.io.BigQuerySource(
            query=CUSTOM_SQL,
            use_standard_sql=True
        ))

    # Side input
    side_input = p | 'ReadSideInput' >> beam.io.Read(
        beam.io.BigQuerySource(
            project=PROJECT_ID,
            dataset=DATASET,
            table=TABLE
        ))

    output = (
        main_input
        | 'ProcessData' >> beam.ParDo(
            ProcessDataDoFn(),
            side_input=beam.pvalue.AsList(side_input))
        | 'WriteOutput' >> beam.io.WriteToText(GCS_BUCKET)
    )

Now the problem is that if I leave the pipeline like this it only output the key of the output_dictionary. 现在的问题是，如果我像这样离开管道，它只输出output_dictionary的键。 If I change the return of ProcessDataDoFn to json.dumps(ouput_dictionary), The Json is written correctly but like this 如果我将ProcessDataDoFn的返回值更改为json.dumps（ouput_dictionary），则Json写得正确但是像这样

{
'
k
e
y
'

:

[
'
s
t
r
i
n
g
'

,

f
l
o
a
t
]

How can I correctly output the results? 如何正确输出结果？

Answer 1

I actually partially solved the issue. 我实际上部分解决了这个问题。

The ParDoFn that I wrote either return a dictionary or a JSON formatted string. 我写的ParDoFn返回字典或JSON格式的字符串。 In both cases, the problem arises when Beam tries to do something with said input. 在这两种情况下，当Beam试图对所述输入做某事时会出现问题。 Beam seems to iterate over a given PCollection if said PCollection is a dictionary, it only gets its key, if said PCollection is a string, it iterates over all the character (that's why the JSON output is so strange). 如果所述PCollection是字典，则Beam似乎迭代给定的PCollection，它只获取其密钥，如果所述PCollection是字符串，则迭代所有字符（这就是JSON输出如此奇怪的原因）。 I find the solution to be rather simple: encapsulate either the dictionary or the string in a list. 我发现解决方案相当简单：将字典或字符串封装在列表中。 The JSON formatting part can either be done at the ParDoFn level or via a Transform like the one you showed. JSON格式化部分既可以在ParDoFn级别完成，也可以通过您所展示的变换完成。

Answer 2

It's unusual that your output looks like that. 你的输出看起来很不寻常。 json.dumps should print json in a single line, and it should go out to files line-by-line. json.dumps应该在一行中打印json，它应该逐行输出到文件。

Perhaps to have cleaner code, you can add an extra map operation that does your formatting however way you need. 也许为了拥有更清晰的代码，您可以添加额外的地图操作，以便按照您需要的方式进行格式化。 Something like so: 像这样的东西：

output = (
  main_input
  | 'ProcessData' >> beam.ParDo(
        ProcessDataDoFn(),
        side_input=beam.pvalue.AsList(side_input))
  | 'FormatOutput' >> beam.Map(json.dumps)
  | 'WriteOutput' >> beam.io.WriteToText(GCS_BUCKET)
)

Apache-Beam + Python：将JSON（或字典）字符串写入输出文件

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-07-24 12:40:27

解决方案2
3 2017-07-23 07:03:31

Apache-Beam + Python：将JSON（或字典）字符串写入输出文件

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-07-24 12:40:27

解决方案2 3 2017-07-23 07:03:31

解决方案1
4 已采纳 2017-07-24 12:40:27

解决方案2
3 2017-07-23 07:03:31