繁体   English   中英

使用火花 dataframe 转置

[英]Using spark dataframe transpose

我试图弄清楚如何使用 spark dataframe 来解决这个用例。

在下面的谷歌表中,我有源数据,其中将存储人们回答的调查问题。 此外,问题列将大约超过 1000 列,并且更加动态且不固定。

有一个元数据表,它解释了问题、它的描述和它可以包含的选项。

Output 表应该是我在工作表中提到的那个。 关于如何实现这一点的任何建议或想法?

https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0

假设您的主表名为df

+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz      |xyz        |john       |1     |2     |1     |
|abc      |abc        |foo        |3     |1     |1     |
|def      |def        |bar        |2     |2     |2     |
+---------+-----------+-----------+------+------+------+

映射表称为df2

+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text      |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102     |Gender       |What is your gender|1        |Male       |
|Q1D102     |Gender       |What is your gender|2        |Female     |
|Q1D102     |Gender       |What is your gender|3        |Diverse    |
|Q1D103     |Age          |What is your age   |1        |20 - 50    |
|Q1D103     |Age          |What is your age   |2        |50 >       |
|Q1D105     |work_status  |Do you work        |1        |Yes        |
|Q1D105     |work_status  |Do you work        |2        |No         |
+-----------+-------------+-------------------+---------+-----------+

我们可以构造一个动态的 unpivot 表达式,如下所示:

val columns = df.columns.filter(c => c.startsWith("Q1D"))

val data = columns.map(c => s"'$c', $c").mkString(",")

val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"

通过 3 个问题,我们得到以下表达式( Q1D102Q1D103Q1D105 ): stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)

最后,我们使用构造变量:

df = df
  .selectExpr("survey_id", "response_id", "person_name", finalExpr)
  .join(df2, Seq("question_id", "choice_id"), "left")

你得到这个结果:

+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text      |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102     |1        |xyz      |xyz        |john       |Gender       |What is your gender|Male       |
|Q1D102     |2        |def      |def        |bar        |Gender       |What is your gender|Female     |
|Q1D102     |3        |abc      |abc        |foo        |Gender       |What is your gender|Diverse    |
|Q1D103     |1        |abc      |abc        |foo        |Age          |What is your age   |20 - 50    |
|Q1D103     |2        |xyz      |xyz        |john       |Age          |What is your age   |50 >       |
|Q1D103     |2        |def      |def        |bar        |Age          |What is your age   |50 >       |
|Q1D105     |1        |xyz      |xyz        |john       |work_status  |Do you work        |Yes        |
|Q1D105     |1        |abc      |abc        |foo        |work_status  |Do you work        |Yes        |
|Q1D105     |2        |def      |def        |bar        |work_status  |Do you work        |No         |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+

我认为这是你需要的(只是无序的),祝你好运!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM