[英]Using spark dataframe transpose
我试图弄清楚如何使用 spark dataframe 来解决这个用例。
在下面的谷歌表中,我有源数据,其中将存储人们回答的调查问题。 此外,问题列将大约超过 1000 列,并且更加动态且不固定。
有一个元数据表,它解释了问题、它的描述和它可以包含的选项。
Output 表应该是我在工作表中提到的那个。 关于如何实现这一点的任何建议或想法?
https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0
假设您的主表名为df
:
+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz |xyz |john |1 |2 |1 |
|abc |abc |foo |3 |1 |1 |
|def |def |bar |2 |2 |2 |
+---------+-----------+-----------+------+------+------+
映射表称为df2
:
+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102 |Gender |What is your gender|1 |Male |
|Q1D102 |Gender |What is your gender|2 |Female |
|Q1D102 |Gender |What is your gender|3 |Diverse |
|Q1D103 |Age |What is your age |1 |20 - 50 |
|Q1D103 |Age |What is your age |2 |50 > |
|Q1D105 |work_status |Do you work |1 |Yes |
|Q1D105 |work_status |Do you work |2 |No |
+-----------+-------------+-------------------+---------+-----------+
我们可以构造一个动态的 unpivot 表达式,如下所示:
val columns = df.columns.filter(c => c.startsWith("Q1D"))
val data = columns.map(c => s"'$c', $c").mkString(",")
val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"
通过 3 个问题,我们得到以下表达式( Q1D102
、 Q1D103
和Q1D105
): stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)
最后,我们使用构造变量:
df = df
.selectExpr("survey_id", "response_id", "person_name", finalExpr)
.join(df2, Seq("question_id", "choice_id"), "left")
你得到这个结果:
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102 |1 |xyz |xyz |john |Gender |What is your gender|Male |
|Q1D102 |2 |def |def |bar |Gender |What is your gender|Female |
|Q1D102 |3 |abc |abc |foo |Gender |What is your gender|Diverse |
|Q1D103 |1 |abc |abc |foo |Age |What is your age |20 - 50 |
|Q1D103 |2 |xyz |xyz |john |Age |What is your age |50 > |
|Q1D103 |2 |def |def |bar |Age |What is your age |50 > |
|Q1D105 |1 |xyz |xyz |john |work_status |Do you work |Yes |
|Q1D105 |1 |abc |abc |foo |work_status |Do you work |Yes |
|Q1D105 |2 |def |def |bar |work_status |Do you work |No |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
我认为这是你需要的(只是无序的),祝你好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.