标签[aws-glue-data-catalog] - 堆栈内存溢出

创建多个表的胶水爬虫 - Glue crawler creating multiple tables

我有 2 个 S3 存储桶，格式如下： s3://bucket/{lob_name_1}/{table_name}/{current_date}/table_name.csv s3://bucket/{lob_name_2}/{table_name}/{current_date}/table_n ...

AWS Glue 作业：调用 getCatalogSource 时出错。无.get - AWS Glue Job : An error occurred while calling getCatalogSource. None.get

我在我的 aws 胶水连接中使用密码/用户名，现在我切换到 Secret Manager。现在我在运行 etl 作业时收到此错误：调用 o89.getCatalogSource 时出错。无.get 即使连接和爬虫工作：连接图像。（我在工作详情中添加了连接）爬虫图像。这个以前工作的 ...

如何在使用 aws_glue_catalog_database 和 aws_glue_catalog_table 资源时使用 terraform 定义 AWS Athena s3 output 位置 - How to define the AWS Athena s3 output location using terraform when using aws_glue_catalog_database and aws_glue_catalog_table resources

概括下面的 terraform 配置创建 aws_glue_catalog_database 和 aws_glue_catalog_table 资源，但未定义在 Athena 上下文中使用这些资源所必需的 s3 存储桶 output 位置。我可以通过 AWS 控制台手动添加 s3 output 位 ...

成功运行作业后，AWS Glue enableUpdateCatalog 未创建新分区 - AWS Glue enableUpdateCatalog not creating new partitions after successful job run

我遇到了一个问题，我设置enableUpdateCatalog=True和updateBehaviour=LOG来更新我的具有 1 个分区键的胶水表。作业完成后，我的胶水目录表上没有添加新分区，但 S3 中的数据由我使用的分区键分隔，我如何让作业自动对胶水目录表进行分区？目前我必须手动运行 bo ...

在 Glue 上按日期分区：1 个日期列与 3 个列（年/月/日）？ - Partitioning by date on Glue: 1 date column vs 3 columns (year/month/day)?

我想知道为什么在 Glue/Athena/Redshift Spectrum 文档和研讨会中，所有关于日期的分区示例都使用 3 列（年/月/日），这意味着类似s3://../year=2022/month=08/day=30/[filename].parquet在 Amazon S3 上。这里有 ...

将数据移动到 Redshift 表时无法使用 write_dynamic_frame.from_catalog 中的 BLANKSASNULL 数据转换参数 - Unable to use BLANKSASNULL Data conversion parameter in write_dynamic_frame.from_catalog while moving data to Redshift table

以前，为了将数据移动到 Redshift 表，我们使用“复制”命令，它具有BLANKSASNULL 和 EMPTYASNULL等数据转换参数的功能。由于我们的数据同时包含“空字符串”和“Null”值，因此我们在移动到 Redshift 表时将两者都转换为 Null。如下所示。示例代码：现在， ...

带 Delta 表的胶水目录连接到 Databricks SQL 引擎 - Glue Catalog w/ Delta Tables Connected to Databricks SQL Engine

我正在尝试从 Databricks SQL 引擎上的 AWS Glue 目录中查询增量表。它们以 Delta Lake 格式存储。我有自动模式的胶水爬虫。该目录是使用非 Delta 表设置和运行的。通过 databricks 的设置通过目录加载每个数据库的可用表，但由于使用 hive 而不是 ...

42703 错误：列“my_nested_column”不存在 - 42703 ERROR: column "my_nested_column" does not exist

我在 S3 上的嵌套 JSON 数据源上运行 Glue Crawler，并尝试根据文档通过 Redshift Spectrum 查询嵌套字段：但根据标题，我收到了错误消息从元数据中我可以看到该字段存在，这实际上没有任何意义。但正因为如此，我无法从“my_nested_column”中取消嵌套字 ...

我们如何在不运行爬虫的情况下更新 aws 胶表中的现有分区数据？ - How can we update existing partition data in aws glue table without running crawler?

当我们通过手动上传到 s3 存储桶来更新现有分区中的数据时，数据将显示在 athena 胶合表的现有分区中。但是当使用 API 更新数据时，上传到 s3 存储桶的数据在现有分区中，但在胶表中数据存储在当前日期[上次修改]的不同分区中（2022 年 8 月 2 日 17:52:15 (UTC+05 ...

为什么 Kinesis 或 Crawler 在我的数据中创建分区？ - Why is Kinesis or Crawler creating partitions in my data?

上下文：我正在根据粘合模式使用运动将 lambda 中的 stream 数据放入 S3 存储桶中。然后，我在我的 S3 存储桶上运行一个爬虫来对我的数据进行分类。我的数据在写入 kinesis firehose 时具有以下属性：'dataset_datetime、attr1、attr2、att ...

如何防止 AWS Glue 爬虫读取错误的数据类型？ - How to prevent AWS Glue crawler from reading wrong data types?

我在 CSV 文件上运行 AWS Glue 爬虫。此 CSV 文件有一个字符串列，其中包含字母数字值。爬虫将此列的数据类型设置为 INT（而不是字符串）。这导致我的 ETL 失败。反正有没有强制胶水来纠正这个问题？我不想将模式手动放入爬虫中，因为这违背了自动数据编目的目的。 ...

通过 Zeppelin 查询存储在 Glue Data Catalog 中的表时出现缓存错误 - Cache error when querying table stored in Glue Data Catalog through Zeppelin

Zeppelin 缓存表的方式有误。我们实时更新 Glue Data Catalog 中的数据，所以当我们要查询一个使用 Spark 更新的分区时，有时会出现以下错误：这可以通过发出命令refresh table <table_name>或从 Zeppelin UI 重新启动 Sp ...

AWS Glue Crawler glob 排除模式功能 - AWS Glue Crawler glob Exclude Pattern functionality

在通过特定路径爬行时，我们需要忽略一些路径。以下是详细信息：完整路径：“s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/” 我们想忽略一些客户的数据。所以我正在使用上面的 glob，但它似乎不起作用。任何帮助将不胜感激。 ...

AWS Glue 爬虫在排除模式条件下添加分区 - AWS Glue Crawlers add partitions within exclude pattern conditions

我遇到了以下情况：假设我有以下 s3 结构 s3://my_bucket/path_to_crawl/partition=A/some_file.parquet s3://my_bucket/path_to_crawl/partition=B/some_file.parquet s ...

AWS 胶水作业 (Pyspark) 到 AWS 胶水数据目录 - AWS glue job (Pyspark) to AWS glue data catalog

我们知道，从pyspark脚本（aws glue job）写入AWS数据目录的过程是使用爬虫写入s3 bucket（eg.csv）并调度。还有其他写入aws glue数据目录的方法吗？我正在寻找一种直接的方法来做到这一点。例如。写入 s3 文件并同步到 aws glue 数据目录。 ...

AWS Athena 分区投影 - AWS Athena partition projection

似乎无法让 Athena 分区投影工作。当我以“老式”方式添加分区然后运行MSCK REPAIR TABLE testparts; 我可以查询数据。我删除表并使用下面的分区投影重新创建，它根本无法查询任何数据。我运行的查询需要很长时间而没有结果，或者它们像下面的查询一样超时。为了争论， ...

胶水作业成功但没有数据插入目标表（Aurora Mysql） - Glue Job Succeeded but no data inserted into the target table (Aurora Mysql)

我使用如下所示的可视选项卡创建了胶水作业。首先，我连接到 mysql 表作为数据源，该表已经在我的数据目录中。然后在转换节点中，我编写了一个自定义 sql 查询到 select 仅来自源表的一列。使用数据预览功能进行验证，转换节点工作正常。现在我想将数据写入现有数据库表，该表只有一列具有“ ...

如果文件夹名称使用 AWS glue 或 lambda 匹配，则将文件从 AWS 存储桶中的一个文件夹复制到另一个文件夹 - copying files from one folder inside AWS bucket to another if the folder name matches using either AWS glue or lambda

我有 2 个 AWS 存储桶暂存和目标都具有相同数量的子文件夹让我们假设 3。所以暂存为 3 名为 a，b，c 和目标有 3 a，b，c。现在我想从 3 个子文件夹 a，复制文件， b，c 到目的地 a，b 和 c 仅当名称匹配时才出现在另一个存储桶中，即 a 到 a，b 到 b，c 到 c 使用 ...

无法从 Glue ETL 作业写入 Lake Formation 管理的表数据 - Cannot write Lake Formation governed table data from Glue ETL Job

我正在使用 Lake Formation 构建 POC，我在其中读取火车运动信息队列并使用 AWS 数据管理员将各个事件保存到受管理的表中。这工作正常。然后，我尝试使用 AWS Glue ETL 作业读取这个受控表，并将结果数据写入另一个受控表。这成功了，并将镶木地板文件写入该表下的 S3 ...

AWS GLUE 上的打印捕获环境 - Environment for print Capture on AWS GLUE

例如，我在哪里可以看到用我的 AWS GLUE 脚本编写的打印件？就像一个终端屏幕，向我显示存储在打印件中的消息。我需要打印为我的数据 output 生成的模式，看看它是否符合我的需要，并了解我的脚本在哪里中断。 ...