![](/img/trans.png)
[英]How to create PCollection<Row> from PCollection<String> for performing beam SQL Trasforms
[英]How to get all file metadata from PCollection<string> in beam
我有一个包含文件路径的扁平 PCollection
PCollection<String> "/this/is/a/123/*.csv , /this/is/a/124/*.csv"
flattenPCollection = pcs.apply(Flatten.<String>pCollections());
我想读取每个文件并获取文件名和进程
flattenPCollection
.apply("Read CSV files", FileIO.matchAll())
.apply("Read matching files",FileIO.readMatches())
.apply("Process each file", ParDo.of(new DoFn<FileIO.ReadableFile, String>() {
@ProcessElement
public void process(@Element FileIO.ReadableFile file) {
// We shloud be able to file and its metadata.
logger.info("File Metadata resourceId is {} ", file.getMetadata().resourceId());
// here we read each line and process
}
}));
发生以下错误
Caused by: java.io.FileNotFoundException: No files matched spec: bob,22,new york
似乎管道正在读取 csv 文件的第一行并在文件系统中查找该字符串。
是什么导致这种情况发生?
我想将每个文件作为 FileIO.ReadableFile
我确信这是我所缺少的非常简单的东西。 任何帮助表示赞赏
更新
如果您有一个 PCollection 路径和文件,您需要手动遍历每个路径和文件并添加 ParDo
for(String path : pathList) {
pipeline.apply(FileIO.match().filepattern(path))
.apply(FileIO.readMatches())
.apply(
ParDo.of(
new DoFn<FileIO.ReadableFile, String>() {
@ProcessElement
public void process(@Element FileIO.ReadableFile file) throws IOException {
logger.info("Metadata - " + file.getMetadata());
logger.info("File Contents - " + file.readFullyAsUTF8String());
logger.info("File Metadata resourceId is " + file.getMetadata().resourceId());
}
}));
}
感谢@bigbounty
Pipeline pipeline = Pipeline.create();
pipeline.apply(FileIO.match().filepattern("/Users/bigbounty/Documents/beam/files/*.txt"))
.apply(FileIO.readMatches())
.apply(
ParDo.of(
new DoFn<FileIO.ReadableFile, String>() {
@ProcessElement
public void process(@Element FileIO.ReadableFile file) throws IOException {
LOG.info("Metadata - " + file.getMetadata());
LOG.info("File Contents - " + file.readFullyAsUTF8String());
LOG.info("File Metadata resourceId is " + file.getMetadata().resourceId());
}
}));
PipelineResult pipelineResult = pipeline.run();
pipelineResult.waitUntilFinish();
Output:
Metadata - Metadata{resourceId=/Users/bigbounty/Document/beam/files/3.txt, sizeBytes=7, isReadSeekEfficient=true, lastModifiedMillis=0}
Metadata - Metadata{resourceId=/Users/bigbounty/Document/beam/files/1.txt, sizeBytes=7, isReadSeekEfficient=true, lastModifiedMillis=0}
Metadata - Metadata{resourceId=/Users/bigbounty/Document/beam/files/2.txt, sizeBytes=7, isReadSeekEfficient=true, lastModifiedMillis=0}
File Contents - hello-1
File Metadata resourceId is /Users/bigbounty/Document/beam/files/1.txt
File Contents - hello-2
File Metadata resourceId is /Users/bigbounty/Document/beam/files/2.txt
File Contents - hello-3
File Metadata resourceId is /Users/bigbounty/Document/beam/files/3.txt
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.