繁体   English   中英

创建 AVRO 文件 AWS Glue 动态帧一对多连接

[英]Create AVRO File AWS Glue Dynamic Frame One to Many Join

AWS Glue 中是否可能出现以下行为? 我正在尝试通过以一对多的方式加入两个 DynamicFrames 来创建单个 AVRO 文件。

例如,我有一个具有多种教师类型的 DyF:teacher_id teacher_name

和具有许多学生类型的 Dyf:student_id teacher_id student_name

我正在尝试将这些结合起来,以便老师可能有很多学生,例如:

 [ { teacher_id: 1, teacher_name: 'John', students: [ { student_id: 100, teacher_id: 1 student_name: 'Sally' }, { student_id: 200, teacher_id: 1, student_name: 'Jack' } ] }, ... ]

使用 Join.apply(teacher, student, 'teacher_id', 'teacher_id') 只会导致重复的行,如下所示:

 [ { teacher_id: 1, teacher_name: 'John', student_id: 100, teacher_id: 1 student_name: 'Sally' }, { teacher_id: 1, teacher_name: 'John', student_id: 200, teacher_id: 1 student_name: 'Jack' }... ] ]

可能有比我在下面建议的更好的方法来做到这一点,但我希望以下方法可行:

from pyspark.sql.functions import col,struct

#first make your two tables into dataframes so we can use Spark
students = students.toDF()
teachers = teachers.toDF()

#then convert your students DF to having a foreign key and a struct
students = students.select(
  col("teacher_id").alias("student_teacher_id"),
  struct("student_id","teacher_id","student_name").alias("student_data"))#I'm not sure you want to keep the teacher_id here, but up to you :)

#then perform your join
result = teachers.join(students, teachers.teacher_id == students.student_teacher_id)

在此之后,您应该以包含所有教师数据的行结束,并且教师数据将具有包含与教师相关的学生的 struct 列。 如果您要将 output 序列化为分层格式(例如 JSON),它应该将每个学生显示为老师的孩子。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM