简体   繁体   中英

Google Dataflow / Dataprep Shuffle key too large (INVALID_ARGUMENT)

I have tried running this job several times, and each time after hitting many quota related warnings (and requesting an increase each time) but in the end it always ends up failing with this error message, which I believe is caused by my dataset being too large, but I'm not sure. Dataprep is supposed to be able to handle ETL jobs of any scale, and this isn't even that large of a job. Anyway, here is the error message, any help would be appreciated:

java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.io.IOException: INVALID_ARGUMENT: Shuffle key too large:2001941 > 1572864
at com.google.cloud.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
at com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:104)
at com.google.cloud.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn.processElement(BatchGroupAlsoByWindowViaIteratorsFn.java:121)
at com.google.cloud.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn.processElement(BatchGroupAlsoByWindowViaIteratorsFn.java:53)
at com.google.cloud.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:117)
...

Full error message can be found here: https://pastebin.com/raw/QTtmm5D2

I have gotten several quota increases, and while that lets the job continue father than before, it still ends in the same error (although the shuffle key size is larger.) It now doesn't appear to be hitting a wall due to a quota related issue.

Any ideas short of ditching Dataprep and going back to map reduce?

This looks to me more likely to be an error where a single value in a single column is too large, rather than that the dataset is too large. Do you have columns with values this long? (about 2MB here apparently)

That said, I think this should be reported as a bug to Dataprep. It seems that they perform a group by column values, and they probably should trim them to a smaller size before grouping. I don't know whether they are following StackOverflow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM