简体   繁体   English

从AWS胶水作业中的数据源中读取标题

[英]Read Headers from Data Source in an AWS Glue Job

I have an AWS Glue job that reads from a data source like so: 我有一个AWS Glue作业,它从数据源读取如下:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-data", table_name = "contacts", transformation_ctx = "datasource0")

But when I call .toDF() on the dynamic frame, the headers are 'col0', 'col1', 'col2' etc. and my actual headers are in the first row of the dataframe. 但是当我在动态帧上调用.toDF()时,标题是“col0”,“col1”,“col2”等,而我的实际标题位于数据帧的第一行。

Note - I can't set them manually as the columns in the data source are variable & iterating over the columns in a loop to set them results in error because you'd have to set the same dataframe variable multiple times, which glue can't handle. 注意 - 我无法手动设置它们,因为数据源中的列是可变的并且在循环中的列上迭代以设置它们导致错误,因为您必须多次设置相同的数据帧变量,这可以' t处理。

How might I capture the headers while reading from the data source? 从数据源读取时如何捕获标题?

It turns out it's a bug in the glue crawler, they don't support headers yet. 事实证明这是胶水爬虫中的一个错误,它们还不支持标题。 The workaround I used was to go through the motions of crawling the data anyways, then when the crawler completes, I have a lambda that triggers off of the crawler completion cloud watch event and the lambda kicks off the glue job that just reads directly from s3. 我使用的解决方法是完成爬行数据的动作,然后当爬虫完成时,我有一个lambda触发爬虫完成云监视事件,lambda启动直接从s3读取的粘合作业。 When glue is fixed to support reading in the headers I can switch out how I read in the headers. 当胶水被固定以支持读取标题时,我可以切换出我在标题中读取的方式。

You can try withHeader param. 你可以尝试使用Header param。 eg 例如

dyF = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
    'csv',
    {'withHeader': True})

The documentation for this can be found here 可以在此处找到相关文档

I know this post is old, but I just ran into a similar issue and spent way too long figuring out what the problem was. 我知道这篇文章很老,但我遇到了类似的问题,花了太长时间搞清楚问题是什么。 Wanted to share my solution in case it's helpful to others! 想要分享我的解决方案,以防它对别人有帮助!

I was using the GUI on AWS and forgot to actually add the correct classifier to the crawler before running it. 我在AWS上使用GUI并忘记在运行之前向爬虫添加正确的分类器。 This resulted in AWS Glue incorrectly detecting datatypes (they mostly came out as strings) and the column names were not detected (they came out as col1, col2, etc). 这导致AWS Glue错误地检测到数据类型(它们大部分以字符串形式出现)并且未检测到列名称(它们以col1,col2等形式出现)。 You can create the classifier in "classifiers" under "crawlers". 您可以在“爬虫”下的“分类器”中创建分类器。 Then, when setting up the crawler, add your classifier to the "selected classifiers" section at the bottom. 然后,在设置抓取工具时,将分类器添加到底部的“选定分类器”部分。

Documentation: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html 文档: https//docs.aws.amazon.com/glue/latest/dg/add-classifier.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM