简体   繁体   中英

How do I skip header files when reading from google cloud storage in a dataflow job?

I'm working on a data processing pipeline where we read a lot of files from cloud storage. The files might be csv files with a header row, which I need to remove so I don't get errors down the line.

If possible I would love to use:

TextIO.Read.from(filePattern)

together with something else since it automatically handles compression and such. Ideally it should look something like this:

TextIO.Read.from(filePattern, numberOfHeaderRows)

and that should just exclude numberOfHeaderRows from the top. What is the easiest way to achieve something like this in java?

最简单的路径可能使用TextIO.Read.from(filePattern)然后使用ParDo过滤掉标题行。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM