![](/img/trans.png)
[英]How to execute list of hive queries which is in gcp storage bucket (in my case gs:/hive/hive.sql") while submitting hive job to dataproc cluster
[英]GCP dataproc cluster hadoop job to move data from gs bucket to s3 amazon bucket fails [CONSOLE]
关于堆栈溢出的第一个问题,所以请原谅我的任何菜鸟错误。
我目前正在将大量数据(700+ GiB)从 GCS 存储桶中的文件夹移动到 s3 中的文件夹,其中包含许多大约 1-10MB 的小文件。
我做了几次尝试:
gsutil -m rsync -r gs://<path> s3://<path>
数据量大导致超时gsutil -m cp -r gs://<path> s3://<path>
耗时太长。 即使有许多并行进程和/或线程,它的平均传输速度仍然约为 3.4MiB/s。 我已确保在此尝试中升级 VM 实例。rclone
与 cp 相同的性能问题最近我发现了另一种可能的方法。 但是我对 GCP 不熟悉,所以请多多包涵,抱歉。 这是我找到的参考https://medium.com/swlh/transfer-data-from-gcs-to-s3-using-google-dataproc-with-airflow-aa49dc896dad该方法涉及通过 GCP 控制台制作一个 dataproc 集群以下配置:
Name:
<dataproc-cluster-name>
Region:
asia-southeast1
Nodes configuration:
1 main 2 worker @2vCPU & @3.75GBMemory & @30GBPersistentDisk
properties:
core fs.s3.awsAccessKeyId <key>
core fs.s3.awsSecretAccessKey <secret>
core fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
然后我通过 GCP 网站的控制台菜单提交作业:
hadoop-mapreduce/hadoop-distcp.jar
。 我只能通过我的主dataproc集群VM实例浏览根文件来找到/usr/lib/hadoop/hadoop-distcp.jar
Start time:
31 Mar 2021, 16:00:25
Elapsed time:
3 sec
Status:
Failed
Region
asia-southeast1
Cluster
<cluster-name>
Job type
Hadoop
Main class or JAR
file://usr/lib/hadoop/hadoop-distcp.jar
Arguments
-update
gs://*
s3://*
返回错误
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2400: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2365: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2460: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_OPTS: invalid variable name
2021-03-31 09:00:28,549 ERROR tools.DistCp: Invalid arguments:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2638)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3342)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3374)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:126)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3425)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3393)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:240)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:441)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2542)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2636)
... 18 more
Invalid arguments: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB, accepts
bandwidth as a fraction.
-blocksperchunk <arg> If set to a positive value, fileswith more
blocks than this value will be split into
chunks of <blocksperchunk> blocks to be
transferred in parallel, and reassembled on
the destination. By default,
<blocksperchunk> is 0 and the files will be
transmitted in their entirety without
splitting. This switch is only applicable
when the source file system implements
getBlockLocations method and the target
file system implements concat method
-copybuffersize <arg> Size of the copy buffer to use. By default
<copybuffersize> is 8192B.
-delete Delete from target, files missing in
source. Delete is applicable only with
update or overwrite options
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-direct Write files directly to the target
location, avoiding temporary file rename.
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied
to <= n
-filters <arg> The path to a file containing a list of
strings for paths to be excluded from the
copy.
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs
are saved
-m <arg> Max number of concurrent maps to use for
copy
-numListstatusThreads <arg> Number of threads to use for building file
listing (max 40).
-overwrite Choose to overwrite target files
unconditionally, even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If
-p is specified with no <arg>, then
preserves replication, block size, user,
group, permission, checksum type and
timestamps. raw.* xattrs are preserved when
both the source and destination paths are
in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is
independent of the -p flag. Refer to the
DistCp documentation for more details.
-rdiff <arg> Use target snapshot diff report to identify
changes made on target
-sizelimit <arg> (Deprecated!) Limit number of files copied
to <= n bytes
-skipcrccheck Whether to skip CRC checks between source
and target paths.
-strategy <arg> Copy strategy to use. Default is dividing
work based on file sizes
-tmp <arg> Intermediate work path to be used for
atomic commit
-update Update target, copying only missing files
or directories
-v Log additional info (path, size) in the
SKIP/COPY log
-xtrack <arg> Save information about missing source files
to the specified directory
我该如何解决这个问题? 我在网上找到的几个修复都不是很有帮助。 他们要么使用 hadoop cli,要么使用不同的 jar 文件作为我的。 例如,这里的这个: 使用 dataproc hadoop 集群和 airflow和Z5E056C500A1C4B6A7110B50D807A135Z://githubspark.com/issues/
免责声明:我不使用 hadoop cli 或 airflow。 我使用控制台执行此操作,通过 dataproc 集群主 VM 实例 shell 提交作业也返回相同的错误。 如果需要,任何详细的参考将不胜感激,非常感谢!
更新:
我认为 Dataproc 解决方案在您的情况下是多余的。 如果您需要每天或每小时将 TB 的数据从 GCS 复制到 S3,Dataproc 将是有意义的。 但听起来你的只是一次性副本,可以运行数小时或数天。 我建议在 Google Cloud (GCP) 实例上运行 gsutil。 我为此尝试了一个 AWS EC2 实例,但对于这个特定的操作,它总是明显变慢。
在同一区域中创建您的源存储桶和目标存储桶。 例如,用于 GCS 的 us-east4(弗吉尼亚北部)和用于 S3 的 us-east-1(弗吉尼亚北部)。 然后在同一个 GCP 区域中部署您的实例。
gsutil -m cp -r gs://* s3://*
. . . 可能不会工作。 它在 Dataproc 中肯定不起作用,如果我没有明确的文件位置或以 / 结尾的存储桶/文件夹,它总是会出错
相反,首先尝试成功地显式复制一个文件。 然后尝试整个文件夹或存储桶。
您要复制多少个文件?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.