简体   繁体   English

如何在s3中对数据进行分区以与hadoop配置单元一起使用?

[英]How should I partition data in s3 for use with hadoop hive?

I have a s3 bucket containing about 300gb of log files in no particular order. 我有一个s3存储桶,其中包含大约300gb的日志文件,没有特定顺序。

I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. 我想使用日期时间戳对该数据进行分区以便在hadoop-hive中使用,以便将与特定日期相关的日志行聚集在同一s3“文件夹”中。 For example log entries for January 1st would be in files matching the following naming: 例如,1月1日的日志条目将位于与以下名称匹配的文件中:

s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3

etc 等等

What would be the best way for me to transform the data? 对我来说,转换数据的最佳方法是什么? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location? 我最好只是运行一个脚本,一次读取每个文件并将数据输出到正确的s3位置吗?

I'm sure there's a good way to do this using hadoop, could someone tell me what that is? 我敢肯定有一种使用hadoop的好方法,有人可以告诉我那是什么吗?

What I've tried: 我试过的

I tried using hadoop-streaming by passing in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. 我尝试使用hadoop流,方法是传入一个映射器,该映射器收集每个日期的所有日志条目,然后将其直接写入S3,对reducer不返回任何内容,但这似乎会创建重复项。 (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million) (使用上面的示例,我在1月1日最终获得250万条目,而不是140万)

Does anyone have any ideas how best to approach this? 有谁知道如何最好地解决这个问题?

Why not create an external table over this data, then use hive to create the new table? 为什么不基于此数据创建外部表,然后使用配置单元创建新表?

create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;

In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries . 实际上,我没有查找语法,因此您可能需要参考https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries对其进行更正。

If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. 如果Hadoop在任务跟踪器中有空闲插槽,它将运行同一任务的多个副本。 If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. 如果您的输出格式不能正确忽略产生的重复输出键和值(S3可能是这种情况;我从未使用过),则应关闭推测性执行。 If your job is map-only, set mapred.map.tasks.speculative.execution to false. 如果您的工作仅是地图,请将mapred.map.tasks.speculative.execution设置为false。 If you have a reducer, set mapred.reduce.tasks.speculative.execution to false. 如果您有减速器,请将mapred.reduce.tasks.speculative.execution设置为false。 Check out Hadoop: The Definitive Guide for more information. 查看Hadoop:权威指南 ,了解更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM