简体   繁体   中英

Spark-SQL using withColumnRenamed()

I am trying to load up a Parquet file with columns storyId1 and publisher1. I want to find all pairs of publishers that publish articles about the same stories. For each publisher pair need to report the number of co-published stories. Where a co-published story in a story published by both publishers. Report the pairs in decreasing order of frequency. The solution must conform to the following rules: 1. There should not be any replicated entries like: NASDAQ, NASDAQ, 1000 2. Should not have the same pair occurring twice in opposite order. Only one of the following should occur: NASDAQ, Reuters, 1000 Reuters, NASDAQ, 1000 (ie it is incorrect to have both of the above two lines in your result)

Now it have tried following code:

> import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import spark.implicits._


val worddocDF = spark.read.parquet("file:///home/user204943816622/t4_story_publishers.parquet")
val worddocDF1 = spark.read.parquet("file:///home/user204943816622/t4_story_publishers.parquet")
worddocDF.cache()
val joinDF = worddocDF.join(worddocDF1, "storyId1").withColumnRenamed("worddocDF.publisher1", "publisher2")
joinDF.filter($"publisher1" !== $"publisher2")

Input format:

[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Livemint]

[ddUyU0VZz0BRneMioxUPQVP6sIxvM, IFA Magazine]

[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Moneynews]

[ddUyU0VZz0BRneMioxUPQVP6sIxvM, NASDAQ]

[dPhGU51DcrolUIMxbRm0InaHGA2XM, IFA Magazine]

[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Los Angeles Times]

[dPhGU51DcrolUIMxbRm0InaHGA2XM, NASDAQ]

Required output:

[

NASDAQ,IFA Magazine,2]

[Moneynews,Livemint,1]

[Moneynews,IFA Magazine,1]

[NASDAQ,Livemint,1]

[NASDAQ,Los Angeles Times,1]

[Moneynews,Los Angeles Times,1]

[Los Angeles Times,IFA Magazine,1]

[Livemint,IFA Magazine,1]

[NASDAQ,Moneynews,1]

[Los Angeles Times,Livemint,1]
import spark.implicits._

    wordDocDf.as("a")
    .join(
      wordDocDf.as("b"),
      $"a.storyId1" === $"b.storyId1" && $"a.publisher1" =!= $"b.publisher1",
      "inner"
    )
    .select(
      $"a.storyId1".as("storyId"),
      $"a.publisher1".as("publisher1"),
      $"b.publisher1".as("publisher2")
    )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM