简体   繁体   English

将 FASTQ 文件读入 AWS Glue 作业脚本

[英]Read FASTQ file into a AWS Glue Job Script

I need to read FASTQ file into AWS Glue Job Script but I'am getting this error:我需要将 FASTQ 文件读入 AWS Glue 作业脚本,但出现此错误:

Traceback (most recent call last): File "/opt/amazon/bin/runscript.py", line 59, in runpy.run_path(script, run_name=' main ') File "/usr/lib64/python3.7/runpy.py", line 261, in run_path code, fname = _get_code_from_file(run_name, path_name) File "/usr/lib64/python3.7/runpy.py", line 236, in _get_code_from_file code = compile(f.read(), fname, 'exec') File "/tmp/test20200930", line 24 datasource0 = spark.createDataset(sc.textFile("s3://sample-genes-data/fastq/S_Sonnei_short_reads_1.fastq").sliding(4, 4).map { ^ SyntaxError: invalid syntax During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/amazon/bin/runscript.py", line 92, in while "runpy.py" in new_stack.tb_frame.f_code.co_filename: AttributeError: 'NoneType' object has no attribute 'tb_frame'回溯(最近一次调用):文件“/opt/amazon/bin/runscript.py”,第 59 行,在 runpy.run_path(script, run_name=' main ') 文件“/usr/lib64/python3.7/runpy .py", line 261, in run_path code, fname = _get_code_from_file(run_name, path_name) File "/usr/lib64/python3.7/runpy.py", line 236, in _get_code_from_file code = compile(f.read(), fname, 'exec') File "/tmp/test20200930", line 24 datasource0 = spark.createDataset(sc.textFile("s3://sample-genes-data/fastq/S_Sonnei_short_reads_1.fastq").sliding(4, 4 ).map { ^ SyntaxError: invalid syntax 在处理上述异常的过程中,发生了另一个异常:Traceback(最近一次调用最后一次):文件“/opt/amazon/bin/runscript.py”,第 92 行,在 while“runpy. py" in new_stack.tb_frame.f_code.co_filename: AttributeError: 'NoneType' 对象没有属性 'tb_frame'

This is my code:这是我的代码:

import org.apache.spark.mllib.rdd.RDDFunctions._

datasource0 = spark.createDataset(sc.textFile("s3://sample-genes-data/fastq/S_Sonnei_short_reads_1.fastq").sliding(4, 4).map {
  case Array(id, seq, _, qual) => (id, seq, qual)
 }).toDF("identifier", "sequence", "quality")
datasource1 = DynamicFrame.fromDF(datasource0, glueContext, "nullv")

I followed this link: Read FASTQ file into a Spark dataframe我点击了这个链接: Read FASTQ file into a Spark dataframe

I was able to run the code by wrapping it inside a GlueApp object.我能够通过将代码包装在GlueApp对象中来运行代码。 You can use below code by replacing the S3 path of yours.您可以通过替换您的 S3 路径来使用以下代码。

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import org.apache.spark.SparkContext
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.mllib.rdd.RDDFunctions._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val sparkSession: SparkSession = glueContext.getSparkSession
    import sparkSession.implicits._
    val datasource0 = sparkSession.createDataset(spark.textFile("s3://<s3path>").sliding(4, 4).map {
  case Array(id, seq, _, qual) => (id, seq, qual)
 }).toDF("identifier", "sequence", "quality")
   val datasource1 = DynamicFrame(datasource0, glueContext)
   datasource1.show()
   datasource1.printSchema()
   Job.commit()
  }
}

Passed Input :通过输入:

@seq1
AGTCAGTCGAC
+
?@@FFBFFDDH
@seq2
CCAGCGTCTCG
+
?88ADA?BDF8

Output:输出:

{"identifier": "@seq1", "sequence": "AGTCAGTCGAC", "quality": "?@@FFBFFDDH"}
{"identifier": "@seq2", "sequence": "CCAGCGTCTCG", "quality": "?88ADA?BDF8"}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM