通过Athena和/或Glue将S3数据ETL转换为CSV

Question

I have an S3 bucket ( com.example.myorg.images ) full of image files, all of them following the same naming convention: 我有一个充满图像文件的S3存储桶（ com.example.myorg.images ），所有文件都遵循相同的命名约定：

<PRODUCT_ID>_<NUMBER>.jpg

Where <PRODUCT_ID> is a long number (a primary key in an RDS table), and <NUMBER> is always one of three values: 100, 200 or 300. So for example, the bucket might contain: 其中<PRODUCT_ID>是一个长数字（RDS表中的主键），而<NUMBER>始终是以下三个值之一：100、200或300。因此，存储桶可能包含：

1394203949_100.jpg 1394203949_100.jpg
1394203949_200.jpg 1394203949_200.jpg
1394203949_300.jpg 1394203949_300.jpg
1394203950_100.jpg 1394203950_100.jpg
1394203950_200.jpg 1394203950_200.jpg
1394203950_300.jpg 1394203950_300.jpg
...etc. ...等等。

I would like to write either an Athena or Glue ETL process that queries the S3 bucket for all the images inside of it, and somehow, extracts the UNIQUE <PRODUCT_ID> values into a table or list. 我想编写一个Athena或Glue ETL进程来查询S3存储桶中的所有图像，并以某种方式将UNIQUE <PRODUCT_ID>值提取到表或列表中。

It's my understanding that Athena will then back up this table/list into a downloadable CSV; 据我了解，Athena随后将将此表/列表备份到可下载的CSV； if true, then I will, separately, process that headerless CSV the way I need it on the command-line. 如果为true，那么我将按照在命令行上所需的方式分别处理该无头CSV。

So for instance, if the 6 images above were the only images in the bucket, then this process would: 因此，例如，如果上面的6张图像是存储桶中仅有的图像，则此过程将：

Query S3 and obtain a table/list consisting of 1394203949 and 1394203950 查询S3并获取由1394203949和1394203950组成的表/列表
Create a downloadable CSV looking like this: 创建一个可下载的CSV，如下所示：

Could be a file on S3 or even in-memory: 可能是S3甚至是内存中的文件：

1394203949,1394203950

Having no prior experience with either Athena or Glue, I'm attempting to accomplish this with an Athena query, but I'm having difficulty seeing the forest through the trees. 既没有雅典娜（Athena）或Glue的经验，我试图通过雅典娜（Athena）查询来完成此任务，但是我很难看穿树林。

My best attempt at the 1st part (the S3 query): 我在第一部分的最佳尝试（S3查询）：

CREATE EXTERNAL TABLE IF NOT EXISTS products_with_thumbnails (
  product_id string
) 
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
LOCATION 's3://com.example.myorg.images/';

Which would set up my in-memory table I believe with the file names of everything in the S3 bucket, but then: 我相信这将使用S3存储桶中所有内容的文件名来设置我的内存表，但是然后：

How do I make this table contain only unique product IDs (no dupes)? 如何使此表仅包含唯一的产品ID（无重复）？
How do I extract out only the <PRODUCT_ID> segment of the filenames ( 1394203949 as opposed to 1394203949_100.jpg )? 如何仅提取文件名的<PRODUCT_ID>段（ 1394203949 ，而不是1394203949_100.jpg ）？

I'm not partial to Athena or Glue, and would be happy with any solution that accomplishes what I need. 我不喜欢Athena或Glue，并且会对满足我需要的任何解决方案感到满意。 Worst case I could write a Lambda that accomplishes all of this ETL at the application layer, but if there is a Hive-like or ETL-oriented AWS service that exists for doing this kind of stuff anyways, I'd rather just leverage that! 最坏的情况是，我可以编写一个Lambda来在应用程序层完成所有这些ETL，但是如果存在像Hive这样的或面向ETL的AWS服务，无论如何都要做这种事情，我宁愿只是利用它！

Thanks in advance! 提前致谢！

Answer 1

Athena queries inside of files, not file listings, so using only Athena for this will not work (there are ways of abusing it to make it happen, but they will be expensive and slow and not what you want). Athena会在文件内部而不是文件列表中进行查询，因此仅使用Athena不能正常工作（有多种方法可以使它变为现实，但它们昂贵而缓慢，而不是您想要的）。

If the number of images is less than a hundred thousand or so I think your best bet is to just write a script that does more or less the equivalent of aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq 如果图像数量少于十万，那么我认为最好的办法就是编写一个脚本，该脚本或多或少地相当于aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq . aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq 。

If it's more than that I suggest using S3 Inventory and perhaps Athena for the processing. 如果不止如此，我建议使用S3广告资源，也许还可以使用Athena进行处理。 You can find instructions on how to enable S3 Inventory, and query the inventory with Athena here: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html 您可以在此处找到有关如何启用S3清单的说明，并通过Athena查询清单： https : //docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html

With an S3 Inventory set up your query could look something like this: 通过设置S3库存，您的查询可能类似于以下内容：

SELECT DISTINCT regexp_extract(key, '(\d+)_\d+\.jpg', 1)
FROM the_inventory_table_name

It might be less work to write a script that processes the inventory than setting up Athena tables, though. 但是，编写脚本来处理清单的工作可能比设置Athena表少。 I really recommend using S3 Inventory instead of listing S3 directly when there are many objects, though. 我确实建议使用S3库存，而不是在有许多对象的情况下直接列出S3。

Answer 2

Looks like you can create a partitioned file of your S3 inventory, in S3, partitioned by date: 看起来您可以在S3中创建按日期划分的S3广告资源的分区文件：

CREATE EXTERNAL TABLE my_inventory(
  `bucket` string,
  key string,
  version_id string,
  is_latest boolean,
  is_delete_marker boolean,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string,
  object_lock_retain_until_date timestamp,
  object_lock_mode string,
  object_lock_legal_hold_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://com.example.myorg.mybucket/com.example.myorg.mybucket/com.example.myorg.mybucket-ORC/hive/';

Then, anytime you want to query that my_inventory table, first repair the partitioned file by creating a new partition for the current date: 然后，无论何时要查询该my_inventory表，都首先通过为当前日期创建一个新分区来修复分区文件：

MSCK REPAIR TABLE my_inventory;

And finally you can query it via PrestoDB's SQL-like syntax: 最后，您可以通过PrestoDB的类似SQL的语法查询它：

SELECT key FROM my_inventory WHERE dt <= '<YYYY-MM-DD>-00-00';

Where <YYYY-MM-DD> is the current date in YYYY-MM-DD format. 其中<YYYY-MM-DD>是YYYY-MM-DD格式的当前日期。

You can then download the query results as a CSV file and process it however you like. 然后，您可以将查询结果下载为CSV文件，并根据需要进行处理。

通过Athena和/或Glue将S3数据ETL转换为CSV

问题描述

2 个解决方案

解决方案1
0 2019-08-18 14:37:40

解决方案2
0 已采纳 2019-08-20 17:37:33

通过Athena和/或Glue将S3数据ETL转换为CSV

问题描述

2 个解决方案

解决方案1 0 2019-08-18 14:37:40

解决方案2 0 已采纳 2019-08-20 17:37:33

解决方案1
0 2019-08-18 14:37:40

解决方案2
0 已采纳 2019-08-20 17:37:33