简体   繁体   中英

Sqoop import directly to S3 bucket from celery airflow worker

My bigdata infrastructure contains Airflow and EMR running in two separate clusters. Currently the data ETL steps are as follows,

  1. Sqoop data on to an Airflow worker (hadoop 2.7 is installed here in pseudo distributed mode)
  2. Sync data to S3
  3. Access data on S3 using Spark on EMR (EMR is running hadoop 3.2.1)

In an attempt to streamline the ETL process, I feel that the second step is completely unnecessary and that it should be possible to directly load data through sqoop in to S3 (sqoop command will be executed on the Airflow worker).

But when I set the sqoop --target-dir parameter to an S3 URL, the sqoop job crashes with java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3 . I have attempted many fixes to overcome this issue but none have been successful so far. Things I have tried are,

  1. Trying to point sqoop to use hadoop on EMR instead of the local pseudo distributed hadoop
  2. Copying possible dependency jar files from EMR to Sqoop libs such as emrfs-hadoop-assembly , hadoop-common and hadoop-hdfs
  3. Different AWS protocols such as s3 , s3a and s3n

I'm confident that I have done all configurations properly to the best of my knowledge. Is there something that I have missed? Or is it a Sqoop limitation which doesn't allow direct loading to S3?

您可以按照以下步骤解决它: https ://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM