简体   繁体   English

Java:使用 apache 光束管道读取存储在存储桶中的 excel 文件

[英]Java: read excel file stored in a bucket using apache beam pipeline

I know the below way but I need in Apache beam pipeline please provide an example :我知道以下方式,但我需要在 Apache 光束管道中请提供一个示例:

try (ReadableByteChannel chan = FileSystems.open(FileSystems.matchNewResource(
            "gs://bucketname/filename.xlsx", false ))) {
      InputStream inputStream = Channels.newInputStream(chan);

I have implemented reading a .xlsx file from the local file system but the same should work for your GCS bucket path.我已经实现了从本地文件系统读取 .xlsx 文件,但同样适用于您的 GCS 存储桶路径。 I have tried the same in a different pipeline, and it worked fine.我在不同的管道中尝试过相同的方法,效果很好。

The enrichedCollection in the below code can be treated like a .csv file being read line by line.下面代码中的enrichedCollection可以被视为逐行读取的.csv 文件。 I have used semicolons as a delimiter to separate out the values.我使用分号作为分隔符来分隔值。

package com.fooBar;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.FileIO;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.IOException;
import java.io.InputStream;
import java.nio.channels.Channels;
import java.util.Iterator;

public class SampleExcelInput {

    public static void main(String[] args) throws IOException {
        
        Pipeline pipeline = Pipeline.create();
        
        PCollection<FileIO.ReadableFile> inputCollection  = pipeline.apply(FileIO.match()
//                .filepattern("gs://bucket/file.xlsx"))
                        .filepattern("C:\\Workspace\\ApacheBeam\\src\\main\\resources\\Inputfiles\\SampleExcel.xlsx"))
                .apply(FileIO.readMatches());
        
        PCollection<String> enrichedCollection = inputCollection.apply(ParDo.of(new ReadXlsxDoFn()));
        //TODO: do further processing treating the lines of enrichedCollection pcollection as if they were read from csv
        pipeline.run().waitUntilFinish();
    }

    static class ReadXlsxDoFn extends DoFn<FileIO.ReadableFile, String>{
        final static String  DELIMITER  = ";";
        @ProcessElement
        public void process(ProcessContext c) throws IOException {
            FileIO.ReadableFile  fileName = c.element();
            System.out.println("FileName being read is :" + fileName);
            assert fileName != null;
            InputStream stream = Channels.newInputStream(fileName.openSeekable());
            XSSFWorkbook wb = new XSSFWorkbook(stream);
            XSSFSheet sheet = wb.getSheetAt(0);     //creating a Sheet object to retrieve object
            //iterating over Excel file
            for (Row row : sheet) {
                Iterator<Cell> cellIterator = row.cellIterator();   //iterating over each column
                StringBuilder sb  = new StringBuilder();
                while (cellIterator.hasNext()) {
                    Cell cell = cellIterator.next();
                    if(cell.getCellType() ==  Cell.CELL_TYPE_NUMERIC){
                        sb.append(cell.getNumericCellValue()).append(DELIMITER);
                    }
                    else{
                        sb.append(cell.getStringCellValue()).append(DELIMITER);
                    }
                }
                System.out.println(sb.substring(0, sb.length()-1));
            c.output(sb.substring(0, sb.length()-1));//removing the delimiter present @End of String 

            }
        }
    }
}

For the Dependencies I had to manually add some jars to make it work, you can take that reference from here对于依赖项,我必须手动添加一些 jar 以使其工作,您可以从此处获取该参考

Apart from the above Jars I have the following as my maven dependencies.除了上面的罐子,我还有以下作为我的 Maven 依赖项。

         <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
            <version>2.37.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-direct-java</artifactId>
            <version>2.37.0</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.11.0</version>
        </dependency>

Link for The sample .xlsx file : here示例 .xlsx 文件的链接: 这里

Console output from the DoFn DoFn 的控制台输出

FileName being read is :ReadableFile{metadata=Metadata{resourceId=C:\Users\USER\Desktop\Java 
   
   


    
    FileName being read is :ReadableFile{metadata=Metadata{resourceId=C:\Users\USER\Desktop\Java Masterclass\ApacheBeam\src\main\resources\Inputfiles\SampleExcel.xlsx, sizeBytes=7360, isReadSeekEfficient=true, checksum=null, lastModifiedMillis=0}, compression=UNCOMPRESSED}
0.0;First Name;Last Name;Gender;Country;Age;Date;Id
1.0;Dulce;Abril;Female;United States;32.0;15/10/2017;1562.0
2.0;Mara;Hashimoto;Female;Great Britain;25.0;16/08/2016;1582.0
3.0;Philip;Gent;Male;France;36.0;21/05/2015;2587.0
4.0;Kathleen;Hanner;Female;United States;25.0;15/10/2017;3549.0
5.0;Nereida;Magwood;Female;United States;58.0;16/08/2016;2468.0
6.0;Gaston;Brumm;Male;United States;24.0;21/05/2015;2554.0
7.0;Etta;Hurn;Female;Great Britain;56.0;15/10/2017;3598.0
8.0;Earlean;Melgar;Female;United States;27.0;16/08/2016;2456.0  
.
.
.
50.0;Rasheeda;Alkire;Female;United States;29.0;16/08/2016;6125.0

Process finished with exit code 0

Note : Since the file is being parsed line by line in a Simple DoFn, this would mean it would be one thread per file.注意:由于文件是在 Simple DoFn 中逐行解析的,这意味着每个文件只有一个线程。 if you have just a single file with a very large size say ~5GB you will see notice a significant performance drop.如果你只有一个非常大的文件,比如~5GB,你会发现性能显着下降。 One workaround for this would be to make the sizes of the input files small and use a wildcard file pattern.一种解决方法是缩小输入文件的大小并使用通配符文件模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Java中使用Apache Beam ParDo函数读取JSON文件 - How to read a JSON file using Apache beam parDo function in Java 使用 java 读取 apache 光束中的多个 csv 文件 - Read multiple csv file in apache beam using java 使用 Apache Beam Java SDK 读取 Parquet 文件而不提供架构 - Read Parquet file using Apache Beam Java SDK without providing schema 如何从 apache 光束 java sdk 中的 minIO 读取文件 - How to read a file from minIO in apache beam java sdk 在Apache Beam管道中结合Java和Python - Combining Java and Python in Apache Beam pipeline Apache Beam管道从csv文件读取,拆分,groupbyKey并写入文本文件时出现“ IllegalStateException”错误。 为什么? - “IllegalStateException” error for Apache Beam pipeline to read from csv file, split, groupbyKey and write to text file. Why? 使用 Apache Beam 和 Dataflow 从按日期分区的动态 GCS 存储桶中读取 - Read from dynamic GCS bucket partitioned by date using Apache Beam and Dataflow 在Java中使用Apache POI读写Excel文件 - Read and Write excel file using Apache POI in java 使用Apache POI和eclipse作为编辑器读取excel文件的java程序 - java program for read an excel file using Apache POI and eclipse as editor 使用Apache POI在Java中读取/写入Excel文件时出现问题 - Problems Read / Write Excel File In Java Using Apache POI
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM