繁体   English   中英

Spark从文本文件创建DataFrame

[英]Spark Creating DataFrame from a text File

我正在尝试从Spark中的文本文件创建dataFrame,但会引发错误,这是我的代码;

case class BusinessSchema(business_id: String, name: String, address: String, city: String, postal_code: String, latitude: String, longitude: String, phone_number: String, tax_code: String,
business_certificate: String, application_date: String, owner_name: String, owner_address: String, owner_city: String, owner_state: String, owner_zip: String)

val businessDataFrame = sc.textFile(s"$baseDir/businesses_plus.txt").map(x=>x.split("\t")).map{
  case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip)} 

val businessRecords = businessDataFrame.toDF()

当我运行此代码时会发生错误;

businessRecords.take(20)

抛出的错误代码;

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23.0 (TID 25, localhost): scala.MatchError: [Ljava.lang.String;@6da1c3f1 (of class [Ljava.lang.String;)

MatchError表示模式匹配失败-某些输入没有匹配的情况。 在这种情况下,有一个单一的情况下的匹配的结果split("\\t")到一个Array与恰好16个元件。

您的数据可能包含一些不符合此假设的记录(具有少于或多于16个制表符分隔的字段),这将导致此异常。

为了解决这个问题, 可以使用collect(f: PartialFunction[T, U])替换map的使用,后者需要一个PartialFunction (它可能会默默地忽略与任何情况都不匹配的输入),这会简单地过滤掉所有错误记录:

sc.textFile(s"$baseDir/businesses_plus.txt").map(x=>x.split("\t")).collect {
  case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip)
} 

-添加案例以捕获错误的记录并对其进行处理-例如,您可以将RDD[BusinessSchema]结果替换为RDD[Either[BusinessSchema, Array[String]]]以反映某些记录无法执行的事实解析,并且仍然有错误的数据可用于记录或其他指示:

val withErrors: RDD[Either[BusinessSchema, Array[String]]] = sc.textFile(s"$baseDir/businesses_plus.txt")
  .map(x=>x.split("\t"))
  .map {
    case Array(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip) => Left(BusinessSchema(business_id, name, address, city, postal_code, latitude, longitude, phone_number, tax_code,business_certificate, application_date, owner_name, owner_address, owner_city, owner_state, owner_zip))
    case badArray => Right(badArray)
  } 

// filter bad records, you can log / count / ignore them
val badRecords: RDD[Array[String]] = withErrors.collect { case Right(a) => a } 

// filter good records - you can go on as planned from here...
val goodRecords: RDD[BusinessSchema] = withErrors.collect { case Left(r) => r }

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM