简体   繁体   English

如何在Apache Beam项目中直接使用google-cloud-storage

[英]How to use google-cloud-storage directly in a Apache Beam project

We are working on an Apache Beam project (version 2.4.0) where we also want to work with a bucket directly through the google-cloud-storage API. 我们正在开发Apache Beam项目(版本2.4.0),我们还希望通过google-cloud-storage API直接使用存储桶。 However, combining some of the beam dependencies with cloud storage, leads to a hard to solve dependency problem. 但是,将一些波束依赖性与云存储相结合,会导致难以解决的依赖性问题。

We saw that Beam 2.4.0 depends on cloud-storage 1.22.0, so that is why we us it below. 我们看到Beam 2.4.0依赖于云存储1.22.0,所以这就是我们在下面使用它的原因。 We had the same issues with 1.27.0. 我们遇到了与1.27.0相同的问题。 The following pom.xml specifies the four beam dependencies we use in our project, of which the last two lead to issues. 以下pom.xml指定了我们在项目中使用的四个梁依赖关系,其中最后两个导致问题。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.bol</groupId>
    <artifactId>beam-plus-storage</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <beam.version>2.4.0</beam.version>
    </properties>

    <dependencies>
        <!-- These first two dependencies do not clash -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-direct-java</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-extensions-join-library</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <!-- This one leads to java.lang.ClassNotFoundException: com.google.api.gax.rpc.HeaderProvider -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <!-- This one leads to java.lang.NoSuchMethodError: com.google.api.services.storage.Storage$Objects$List.setUserProject(...) -->
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
            <version>${beam.version}</version>
        </dependency>

        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-storage</artifactId>
            <version>1.22.0</version>
        </dependency>

    </dependencies>
</project>

Below is a minimal working/broken usage of the storage API, listing files from a public bucket. 以下是存储API的最小工作/损坏用法,列出了公共存储桶中的文件。

import com.google.api.gax.paging.Page;
import com.google.cloud.storage.Blob;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;

public class CloudStorageReader {

    public static void main(String[] args) {
        Storage storage = StorageOptions.getDefaultInstance().getService();
        Page<Blob> list = storage.list("gcp-public-data-landsat", Storage.BlobListOption.currentDirectory(), Storage.BlobListOption.prefix("LC08/PRE/044/034/LC80440342016259LGN00/"));
        for (Blob blob : list.getValues()) {
            System.out.println(blob);
        }
    }
}

When removing the last two dependencies, listing the bucket's content works fine. 删除最后两个依赖项时,列出存储桶的内容可以正常工作。 With the java-io beam dependency, the HeaderProvider class is not found. 使用java-io beam依赖项时,找不到HeaderProvider类。 With the dataflow dependency, the setUserProject method is not found. 由于数据流依赖性,找不到setUserProject方法。 See the comments in the pom for full error messages. 请参阅pom中的注释以获取完整的错误消息。

We spent quite some time trying to fix the HeaderProvider error, which is the one appearing when all four beam dependencies are imported. 我们花了很多时间来尝试修复HeaderProvider错误,这是导入所有四个波束依赖关系时出现的错误。 We added explicit imports for the clashing dependencies, adding excludes on the beam imports as well. 我们为冲突依赖项添加了显式导入,并在梁导入中添加了排除。 Every time we added an explicit dependency, another related issue popped up. 每次我们添加显式依赖项时,都会弹出另一个相关问题。 We attempted maven shading, which is not that practical due to our project's packaging, so never got it to work. 我们尝试了maven阴影,由于我们项目的包装,这不是那么实用,所以从来没有让它起作用。

In the end, we resorted to creating a separate sub-module + jar for the cloud-storage interaction, introducing more complexity to our packaging/running. 最后,我们为云存储交互创建了一个单独的子模块+ jar,为我们的打包/运行带来了更多的复杂性。

As a final note, we had the same issue when trying to use the BigQuery API, but worked around that by re-using package-private beam code. 最后一点,我们在尝试使用BigQuery API时遇到了同样的问题,但通过重用包私有光束代码解决了这个问题。

It would be awesome if someone did have a (relatively simple) way to get these libraries working together, or could confirm this really is a challenging dependency issue that may need to be addressed in Apache Beam. 如果某人确实有(相对简单的)方法让这些库协同工作,或者确认这确实是一个具有挑战性的依赖性问题,可能需要在Apache Beam中解决,那将是非常棒的。

Instead of including a separate dependency for Cloud Storage, you can utilize Beam's included FileSystems API to list buckets, read/write files, and delete objects on Cloud Storage. 您可以利用Beam包含的FileSystems API列出存储桶中的存储桶,读/写文件和删除对象,而不是包含单独的云存储依赖项。 Below is an example which lists all files under a bucket and then reads one of those files into a string. 下面是一个示例,其中列出了存储桶下的所有文件,然后将其中一个文件读入字符串。

// Set the default pipeline options so the various filesystems are
// loaded into the registry. This shouldn't be necessary if used
// within a pipeline.
FileSystems.setDefaultPipelineOptions(PipelineOptionsFactory.create());

// List Bucket
MatchResult listResult = FileSystems.match("gs://filesystems-demo/**/*");
listResult
    .metadata()
    .forEach(
        metadata -> {
          ResourceId resourceId = metadata.resourceId();
          System.out.println(resourceId.toString());
        });


// Read file
ResourceId existingFileResourceId = FileSystems
    .matchSingleFileSpec("gs://filesystems-demo/test-file1.csv")
    .resourceId();

try (ByteArrayOutputStream out = new ByteArrayOutputStream();
    ReadableByteChannel readerChannel = FileSystems.open(existingFileResourceId);
    WritableByteChannel writerChannel = Channels.newChannel(out)) {
  ByteStreams.copy(readerChannel, writerChannel);

  System.out.println("File contents: \n" + out.toString());
}


// Write file
String contentToWrite = "Laces out Dan!";

ResourceId newFileResourceId = FileSystems
    .matchNewResource("gs://filesystems-demo/new-file.txt", false);

try (ByteArrayInputStream in = new ByteArrayInputStream(contentToWrite.getBytes());
    ReadableByteChannel readerChannel = Channels.newChannel(in);
    WritableByteChannel writerChannel = FileSystems.create(newFileResourceId, MimeTypes.TEXT)) {

  ByteStreams.copy(readerChannel, writerChannel);
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法使用Google-Cloud-Storage客户端库运行空的AndroidStudio 0.8.14项目 - Can not run empty AndroidStudio 0.8.14 project with google-cloud-storage client library 通过appengine上传到google-cloud-storage时如何向文件添加元数据 - how to add metadata to file when uploading to google-cloud-storage via appengine 如何获取存储在google-cloud-storage桶的子目录中的图像的URL - How do I get the URL for an image stored in a subdirectory of a google-cloud-storage bucket IntelliJ无法找到google-cloud-storage类 - IntelliJ can't find google-cloud-storage classes 如何在 Apache Beam / Google Dataflow 中使用 ParseJsons? - How to use ParseJsons in Apache Beam / Google Dataflow? Java/Spring 使用 Google Cloud firebase-admin 与 google-cloud-storage 冲突 - Java/Spring using Google Cloud firebase-admin conflicts with google-cloud-storage 需要有关单词“ Google云” Apache Beam Beam代码中“ @”符号的详细信息 - Need details of how the “@” symbol in words google cloud apache beam code Java `https.proxyHost` 和 `https.proxyPort` 在使用 google-cloud-storage 时成功然后失败 - Java `https.proxyHost` and `https.proxyPort` succeed then fail when using google-cloud-storage 如何在 Google Cloud Storage 中获取项目 ID - How to get project id in Google Cloud Storage 如何使用 Google Cloud Dataflow 增加 Apache Beam 管道工作线程上的线程堆栈大小? - How can I increase the thread stack size on Apache Beam pipeline workers with Google Cloud Dataflow?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM