[英]ETLing S3 data into CSV via Athena and/or Glue
I have an S3 bucket ( com.example.myorg.images
) full of image files, all of them following the same naming convention: 我有一个充满图像文件的S3存储桶(
com.example.myorg.images
),所有文件都遵循相同的命名约定:
<PRODUCT_ID>_<NUMBER>.jpg
Where <PRODUCT_ID>
is a long number (a primary key in an RDS table), and <NUMBER>
is always one of three values: 100, 200 or 300. So for example, the bucket might contain: 其中
<PRODUCT_ID>
是一个长数字(RDS表中的主键),而<NUMBER>
始终是以下三个值之一:100、200或300。因此,存储桶可能包含:
I would like to write either an Athena or Glue ETL process that queries the S3 bucket for all the images inside of it, and somehow, extracts the UNIQUE <PRODUCT_ID>
values into a table or list. 我想编写一个Athena或Glue ETL进程来查询S3存储桶中的所有图像,并以某种方式将UNIQUE
<PRODUCT_ID>
值提取到表或列表中。
It's my understanding that Athena will then back up this table/list into a downloadable CSV; 据我了解,Athena随后将将此表/列表备份到可下载的CSV; if true, then I will, separately, process that headerless CSV the way I need it on the command-line.
如果为true,那么我将按照在命令行上所需的方式分别处理该无头CSV。
So for instance, if the 6 images above were the only images in the bucket, then this process would: 因此,例如,如果上面的6张图像是存储桶中仅有的图像,则此过程将:
1394203949
and 1394203950
1394203949
和1394203950
组成的表/列表 Could be a file on S3 or even in-memory: 可能是S3甚至是内存中的文件:
1394203949,1394203950
Having no prior experience with either Athena or Glue, I'm attempting to accomplish this with an Athena query, but I'm having difficulty seeing the forest through the trees. 既没有雅典娜(Athena)或Glue的经验,我试图通过雅典娜(Athena)查询来完成此任务,但是我很难看穿树林。
My best attempt at the 1st part (the S3 query): 我在第一部分的最佳尝试(S3查询):
CREATE EXTERNAL TABLE IF NOT EXISTS products_with_thumbnails (
product_id string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://com.example.myorg.images/';
Which would set up my in-memory table I believe with the file names of everything in the S3 bucket, but then: 我相信这将使用S3存储桶中所有内容的文件名来设置我的内存表,但是然后:
<PRODUCT_ID>
segment of the filenames ( 1394203949
as opposed to 1394203949_100.jpg
)? <PRODUCT_ID>
段( 1394203949
,而不是1394203949_100.jpg
)? I'm not partial to Athena or Glue, and would be happy with any solution that accomplishes what I need. 我不喜欢Athena或Glue,并且会对满足我需要的任何解决方案感到满意。 Worst case I could write a Lambda that accomplishes all of this ETL at the application layer, but if there is a Hive-like or ETL-oriented AWS service that exists for doing this kind of stuff anyways, I'd rather just leverage that!
最坏的情况是,我可以编写一个Lambda来在应用程序层完成所有这些ETL,但是如果存在像Hive这样的或面向ETL的AWS服务,无论如何都要做这种事情,我宁愿只是利用它!
Thanks in advance! 提前致谢!
Athena queries inside of files, not file listings, so using only Athena for this will not work (there are ways of abusing it to make it happen, but they will be expensive and slow and not what you want). Athena会在文件内部而不是文件列表中进行查询,因此仅使用Athena不能正常工作(有多种方法可以使它变为现实,但它们昂贵而缓慢,而不是您想要的)。
If the number of images is less than a hundred thousand or so I think your best bet is to just write a script that does more or less the equivalent of aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq
如果图像数量少于十万,那么我认为最好的办法就是编写一个脚本,该脚本或多或少地相当于
aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq
aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq
aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq
. aws s3 ls --recursive s3://some-bucket/ | perl -ne '/(\\d+)_d+\\.jpg$/ && print "$1\\n"' | uniq
。
If it's more than that I suggest using S3 Inventory and perhaps Athena for the processing. 如果不止如此,我建议使用S3广告资源,也许还可以使用Athena进行处理。 You can find instructions on how to enable S3 Inventory, and query the inventory with Athena here: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
您可以在此处找到有关如何启用S3清单的说明,并通过Athena查询清单: https : //docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
With an S3 Inventory set up your query could look something like this: 通过设置S3库存,您的查询可能类似于以下内容:
SELECT DISTINCT regexp_extract(key, '(\d+)_\d+\.jpg', 1)
FROM the_inventory_table_name
It might be less work to write a script that processes the inventory than setting up Athena tables, though. 但是,编写脚本来处理清单的工作可能比设置Athena表少。 I really recommend using S3 Inventory instead of listing S3 directly when there are many objects, though.
我确实建议使用S3库存,而不是在有许多对象的情况下直接列出S3。
Looks like you can create a partitioned file of your S3 inventory, in S3, partitioned by date: 看起来您可以在S3中创建按日期划分的S3广告资源的分区文件:
CREATE EXTERNAL TABLE my_inventory(
`bucket` string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string,
object_lock_retain_until_date timestamp,
object_lock_mode string,
object_lock_legal_hold_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://com.example.myorg.mybucket/com.example.myorg.mybucket/com.example.myorg.mybucket-ORC/hive/';
Then, anytime you want to query that my_inventory
table, first repair the partitioned file by creating a new partition for the current date: 然后,无论何时要查询该
my_inventory
表,都首先通过为当前日期创建一个新分区来修复分区文件:
MSCK REPAIR TABLE my_inventory;
And finally you can query it via PrestoDB's SQL-like syntax: 最后,您可以通过PrestoDB的类似SQL的语法查询它:
SELECT key FROM my_inventory WHERE dt <= '<YYYY-MM-DD>-00-00';
Where <YYYY-MM-DD>
is the current date in YYYY-MM-DD
format. 其中
<YYYY-MM-DD>
是YYYY-MM-DD
格式的当前日期。
You can then download the query results as a CSV file and process it however you like. 然后,您可以将查询结果下载为CSV文件,并根据需要进行处理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.