简体繁体中英

Sqoop import directly to S3 bucket from celery airflow worker

原文 2021-10-28 19:44:07 4 1 hadoop/ amazon-s3/ airflow/ amazon-emr/ sqoop

My bigdata infrastructure contains Airflow and EMR running in two separate clusters. Currently the data ETL steps are as follows,

Sqoop data on to an Airflow worker (hadoop 2.7 is installed here in pseudo distributed mode)
Sync data to S3
Access data on S3 using Spark on EMR (EMR is running hadoop 3.2.1)

In an attempt to streamline the ETL process, I feel that the second step is completely unnecessary and that it should be possible to directly load data through sqoop in to S3 (sqoop command will be executed on the Airflow worker).

But when I set the sqoop --target-dir parameter to an S3 URL, the sqoop job crashes with java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3 . I have attempted many fixes to overcome this issue but none have been successful so far. Things I have tried are,

Trying to point sqoop to use hadoop on EMR instead of the local pseudo distributed hadoop
Copying possible dependency jar files from EMR to Sqoop libs such as emrfs-hadoop-assembly , hadoop-common and hadoop-hdfs
Different AWS protocols such as s3 , s3a and s3n

I'm confident that I have done all configurations properly to the best of my knowledge. Is there something that I have missed? Or is it a Sqoop limitation which doesn't allow direct loading to S3?

1 answers

您可以按照以下步骤解决它： https ://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/

Import data to Hdfs from AWS S3 using Sqoop

Download a file from the Internet directly to my S3 bucket

Sqoop incremental import to S3 Wrong FS error

Cannot write spark job output into s3 bucket directly

Import selected data from oracle db to S3 using sqoop and create hive table script on AWS EMR with selected data

Sqoop import from couchbase to hadoop

Sqoop Import from Hive to Hive

Sqoop import from Intersystems Caché

sqoop import from oracle fails

Sqoop : import data from Oracle

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Import data to Hdfs from AWS S3 using Sqoop Download a file from the Internet directly to my S3 bucket Sqoop incremental import to S3 Wrong FS error Cannot write spark job output into s3 bucket directly Import selected data from oracle db to S3 using sqoop and create hive table script on AWS EMR with selected data Sqoop import from couchbase to hadoop Sqoop Import from Hive to Hive Sqoop import from Intersystems Caché sqoop import from oracle fails Sqoop : import data from Oracle

Related Tags

Sqoop import directly to S3 bucket from celery airflow worker

Question

1 answers

solution1 0 2022-07-05 12:02:05

solution1
0 2022-07-05 12:02:05