Error while running python dataflow job:

Question

I am trying to process an input text file using its first character using GCP Dataflow Python. If the first character of an entry is 'A', I want to store the file in A.txt and so on. Similarly, I have a number associated with each character. I have stored two hashmaps for this. The following is my code:

splitHashMap={'A':1,'F':4, 'J':4, 'Z':4, 'G':10, 'I':11};
fileHashMap= {'A':'A.txt','B':'B.txt','F':'F.txt','J':'J.txt','Z':'Z.txt','G':'G.txt','I':'I.txt'};
def to_table_row(x):
  firstChar=x[0][0];
  global splitHashMap
  global fileHashMap
  print splitHashMap[firstChar];
  x | WriteToText(fileHashMap[firstChar]);
  return {firstChar}

The error is with the WriteToText function and is as follows:

PTransform Create: Refusing to treat string as an iterable. (string=u'AIGLM0012016-02-180000000112016-02-18-12.00.00.123456GB CARMB00132') [while running 'ToTableRows']

Could someone please help me resolve this issue?

EDIT: The remainder of the code containing the pipeline is as follows:

arser = argparse.ArgumentParser()
parser.add_argument('--input',
                  dest='input',
                  default='gs://dataflow-samples/shakespeare/kinglear.txt',
                  help='Input file to process.')
parser.add_argument('--output',
                  dest='output',
                  help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(None)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)


lines = p | 'read' >> ReadFromText(known_args.input)


lines | 'ToTableRows' >> beam.Map(to_table_row);

result = p.run()

I request you to help me resolve the issue now. The command I use to tun the python file is:

python File_parse.py ---input temp.txt

Temp.txt is as follows:

Aasadasd asdasd adsad af
Jdsad asdasd asd as
A asdd ad agfsfg sfg 
Z afsdfrew320pjpoji
Idadfsd w8480ujfds

The desired output is that all the files starting with 'A' go to "A.txt", 'B' go to "B.txt" and so on. It would be great if you wrote the code in your response.

Answer 1

Your use of WriteToText is not appropriate. You can't pass a string to a PTransform. Instead, you need to pass PCollections into PTransforms. In the following code, you can create separate PCollections for each case of a first character, and pass that

What you can do in this case is something like this:

file_hash_map= {'A':'A.txt','B':'B.txt','F':'F.txt',
                'J':'J.txt','Z':'Z.txt','G':'G.txt','I':'I.txt'}
existing_chars = file_hash_map.keys()

class ToTableRowDoFn(beam.DoFn):
  def process(self, element):
    first_char = element[0][0]
    if first_char in file_hash_map:
      yield pvalue.TaggedOutput(first_char, element)
    else:
      # When the first char of the word is not from the allowed
      # characters, we just send it to the main output.
      yield element 

lines = p | 'read' >> ReadFromText(known_args.input)

multiple_outputs = (
    lines | 
    'ToTableRows' >> beam.ParDo(ToTableRowDoFn())
                           .with_outputs(*existing_chars, main='main'));

for pcollection_name in existing_chars:
  char_pcollection = getattr(multiple_outputs, pcollection_name)
  char_pcollection | WriteToFile(file_hash_map[pcollection_name])

The crux of this code is on the for loop, where we iterate over each one of the output PCollections, and write their contents individually to a different file.

Error while running python dataflow job:

Question

1 answers

solution1
0 2017-08-24 17:02:16

Error while running python dataflow job:

Question

1 answers

solution1 0 2017-08-24 17:02:16

solution1
0 2017-08-24 17:02:16