简体   繁体   中英

elasticsearch-spark failing to index type after parent is set

I start of by manually setting the type via the rest API with the following command:

curl -XPUT localhost:9200/myIndex/ -d '{
  "mappings" : { 
          "company": {}, 
          "people": {
               "_parent" : {
                   "type" : "company"
                }
           }
       }
}'

Yet, at the spark layer, with the following code

Here is the people mapping

object PeopleDataCleaner {
  def main(args: Array[String]): Unit = {
    val liftedArgs = args.lift
    val mongoURL = liftedArgs(0).getOrElse("mongodb://127.0.0.1/mg_test.lc_data_test")
    val elasticsearchHost = liftedArgs(1).getOrElse("52.35.155.55")
    val elasticsearchPort = liftedArgs(2).getOrElse("9200")
    val mongoReadPreferences = liftedArgs(3).getOrElse("primary")
    val spark = SparkSession.builder()
      .appName("Mongo Data CLeaner")
      .master("local[*]")
      .config("spark.mongodb.input.uri", mongoURL)
      .config("mongo.input.query", "{currentCompanies : {$exists: true, $ne: []}}")
      .config("mongo.readPreference.name", mongoReadPreferences)
      .config("es.nodes", elasticsearchHost)
      .config("es.port", elasticsearchPort)
      .getOrCreate()
    import spark.implicits._
    val data = MongoSpark.load[LCDataRecord](spark)
      .as[LCDataRecord]
      .filter { record =>
        record.currentCompanies != null &&
        record.currentCompanies.nonEmpty &&
        record.linkedinId != null
      }
      .map { record =>
        val moddedCurrentCompanies = record.currentCompanies
          .filter { currentCompany => currentCompany.link != null && currentCompany.link != "" }
        record.copy(currentCompanies = moddedCurrentCompanies)
      }
      .flatMap { record =>
          record.currentCompanies.map { currentCompany =>
            currentCompanyToFlatPerson(record, currentCompany)
          }
      }
      .saveToEs("myIndex/people", Map(
        "es.mapping.id" -> "idField",
        "es.mapping.parent" -> "companyLink"
      ))
  }

here is the company

object CompanyDataCleaner {
  def main(args: Array[String]): Unit = {
    val liftedArgs = args.lift
    val mongoURL = liftedArgs(0).getOrElse("mongodb://127.0.0.1/mg_test.lc_data_test")
    val elasticsearchHost = liftedArgs(1).getOrElse("localhost")
    val elasticsearchPort = liftedArgs(2).getOrElse("9200")
    val mongoReadPreferences = liftedArgs(3).getOrElse("primary")
    val spark = SparkSession.builder()
      .appName("Mongo Data CLeaner")
      .master("local[*]")
      .config("spark.mongodb.input.uri", mongoURL)
      .config("mongo.input.query", "{currentCompanies : {$exists: true, $ne: []}}")
      .config("mongo.readPreference.name", mongoReadPreferences)
      .config("es.index.auto.create", "true")
      .config("es.nodes", elasticsearchHost)
      .config("es.port", elasticsearchPort)
      .getOrCreate()

    import spark.implicits._
    val data = MongoSpark
      .load[LCDataRecord](spark)
      .as[LCDataRecord]
      .filter { record => record.currentCompanies != null && record.currentCompanies.nonEmpty }
      .flatMap(record => record.currentCompanies)
      .filter { record => record.link != null }
      .dropDuplicates("link")
      .map(formatCompanySizes)
      .map(companyToFlatCompany)
      .saveToEs("myIndex/company", Map("es.mapping.id" -> "link"))

  }

There is a failure message stating org.apache.spark.util.TaskCompletionListenerException: Can't specify parent if no parent field has been configured . This is not an issue by first indexing the companies into elasticsearch, my understanding is that the above mapping should have defined the parent/child relationship.

EDIT Using the bulk API over REST or using the normal REST indexing API doesn't encounter this issue.

Setting .config("es.index.auto.create", "true") to .config("es.index.auto.create", "false") fixes the problem for me. It appears that even though the index and type exist, EsSpark is still trying to create it, and if it has a parent field set thats not a legal operation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM