简体   繁体   English

每个文件一个配置单元分区

[英]hive partition per one file

I don't want file to pile up too much I suffered error in the past because number of hdfs files exceeded the limit I suspect that number of directories is contained in the maximum file number. 我不希望文件堆积太多,过去我曾遇到过错误,因为hdfs文件数超出了限制,我怀疑最大文件数中包含目录数。 so I want to partitioned table with one file not directory 所以I want to partitioned table with one file not directory

directory for partition that I know: 我知道的分区目录:

/test/test.db/test_log/create_date=2013-04-09/2013-04-09.csv.gz
/test/test.db/test_log/create_date=2013-04-10/2013-04-10.csv.gz

I tried add partition like this. 我试过像这样添加分区。 It works. 有用。

ALTER TABLE test_log ADD PARTITION (create_date='2013-04-09') LOCATION '/test/tmp/test_log/2013-04-09.csv.gz'

file path for partition that I want: 我想要的分区的文件路径:

/test/test.db/test_log/create_date=2013-04-09.csv.gz
/test/test.db/test_log/create_date=2013-04-10.csv.gz

I tried add partition like this 我试过像这样添加分区

ALTER TABLE test_log ADD PARTITION (create_date='2013-04-09') LOCATION '/test/tmp/test_log/2013-04-09.csv.gz'

It raised error 引发错误

======================
HIVE FAILURE OUTPUT
======================
SET hive.support.sql11.reserved.keywords=false
SET hive.metastore.warehouse.dir=hdfs:/test/test.db
OK
OK
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://ABCDEFG/test/tmp/test_log/2013-04-09.csv.gz is not a directory or unable to create one)

======================
END HIVE FAILURE OUTPUT
======================

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/spark/python/pyspark/sql/context.py", line 580, in sql
    return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql.
: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://ABCDEFG/test/tmp/test_log/2013-04-09.csv.gz is not a directory or unable to create one)

table schema is something like this 表架构是这样的

CREATE TABLE IF NOT EXISTS test_log (
    testid INT, 
    create_dt STRING
) 
PARTITIONED BY (create_date STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE
  • I converted some paths in command because of my privacy so typographical error can exist. 由于我的隐私,我在命令中转换了一些路径,因此可能存在印刷错误。 please don't pay attention to file name 请不要注意文件名

You should only specify till the folder when you are creating/altering the location of a hive table 创建/更改配置单元表的位置时,应仅指定到文件夹为止

ALTER TABLE test_log ADD PARTITION (create_date='2013-04-09') LOCATION '/test/tmp/test_log/create_date=2013-04-09/'

Put the file in the location 将文件放在该位置

hadoop fs -put /test/test.db/test_log/create_date=2013-04-09/create_date=2013-04-09.csv.gz

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM