简体   繁体   中英

Spark Scala: build a new column using a function using another dataframe

Here's my issue: I have a first dataframe which is basically a list of cities, and the country they reside in. I have a second dataframe, with a list of users, and the cities they reside in. I'd like to add a "country" column to the second dataframe, where its value would be based on the "city" column of course, but the city names can me typed differently (for example Washington and washington would both have to give me USA).

I though the best way to do that would be to create a foo(country: String): String which would return the country by parsing the first dataframe, but I can't find a way to use this function while creating my new column.

first put in lower case the city column of both dataframes, since you go to join on key city and after, effect the capitalize of the first letter, this code should do what you are looking for:

object Main {

  def main(args: Array[String]): Unit = {
    val sparkSession: SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate()

    import sparkSession.implicits._

    val citiesDF = Seq(
      ("London", "England"), ("Washington", "USA")
    )
      .toDF("city", "country")
      .withColumn("city", lower(col("city")))

    val usersDF = Seq(
      ("Andy", "London"), ("Mark", "Washington"), ("Bob", "washington")
    )
      .toDF("name", "city")
      .withColumn("city", lower(col("city")))

    val resultDF = citiesDF.join(usersDF, Seq("city"))
      .withColumn("city", initcap(col("city")))
    resultDF.show()
  }

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM