I have tried to transfer a file from Azure container to GCS bucket, but end up with below issues
Is there anyway, Apache pipeline can transfer the file itself without dealing with the content of the file (so that, above mentioned issues won't happen)? As I need to transfer multiple files from Azure container to GCS bucket
below code I am using to transfer the files at the moment
String format = LocalDateTime.now().format(DateTimeFormatter.ofPattern("YYYY_MM_DD_HH_MM_SS3")).toString();
String connectionString = "<<AZURE_STORAGE_CONNECTION_STRING>>";
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BlobstoreOptions.class).setAzureConnectionString(connectionString);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.read().from("azfs://storageaccountname/containername/CSVSample.csv"))
.apply("",FileIO.<String>write().to("azfs://storageaccountname/containername/"+format+"/").withNumShards(1).withSuffix(".csv")
.via(TextIO.sink()));
p.run().waitUntilFinish();
You should be able to use FileIO transforms for this purpose.
For example (untested pseudocode),
FileIO.match().filepattern("azfs://storageaccountname/containername/CSVSample.csv")
.apply(FileIO.readMatches())
.apply(ParDo.of(new MyWriteDoFn()));
Above MyWriteDoFn()
would be a DoFn
that reads bytes from a single file (using AzureBlobStoreFileSystem ) and writes to GCS (using GCSFileSystem ). You can use the static methods in FileSystems class with the correct prefix instead of directly invoking methods of the underlying FileSystem implementations.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.