My bigdata infrastructure contains Airflow and EMR running in two separate clusters. Currently the data ETL steps are as follows,
In an attempt to streamline the ETL process, I feel that the second step is completely unnecessary and that it should be possible to directly load data through sqoop in to S3 (sqoop command will be executed on the Airflow worker).
But when I set the sqoop --target-dir
parameter to an S3 URL, the sqoop job crashes with java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3
. I have attempted many fixes to overcome this issue but none have been successful so far. Things I have tried are,
emrfs-hadoop-assembly
, hadoop-common
and hadoop-hdfs
s3
, s3a
and s3n
I'm confident that I have done all configurations properly to the best of my knowledge. Is there something that I have missed? Or is it a Sqoop limitation which doesn't allow direct loading to S3?
您可以按照以下步骤解决它: https ://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.