每个文件一个配置单元分区

Question

I don't want file to pile up too much I suffered error in the past because number of hdfs files exceeded the limit I suspect that number of directories is contained in the maximum file number. 我不希望文件堆积太多，过去我曾遇到过错误，因为hdfs文件数超出了限制，我怀疑最大文件数中包含目录数。 so I want to partitioned table with one file not directory 所以I want to partitioned table with one file not directory

directory for partition that I know: 我知道的分区目录：

/test/test.db/test_log/create_date=2013-04-09/2013-04-09.csv.gz
/test/test.db/test_log/create_date=2013-04-10/2013-04-10.csv.gz

I tried add partition like this. 我试过像这样添加分区。 It works. 有用。

ALTER TABLE test_log ADD PARTITION (create_date='2013-04-09') LOCATION '/test/tmp/test_log/2013-04-09.csv.gz'

file path for partition that I want: 我想要的分区的文件路径：

/test/test.db/test_log/create_date=2013-04-09.csv.gz
/test/test.db/test_log/create_date=2013-04-10.csv.gz

I tried add partition like this 我试过像这样添加分区

ALTER TABLE test_log ADD PARTITION (create_date='2013-04-09') LOCATION '/test/tmp/test_log/2013-04-09.csv.gz'

It raised error 引发错误

======================
HIVE FAILURE OUTPUT
======================
SET hive.support.sql11.reserved.keywords=false
SET hive.metastore.warehouse.dir=hdfs:/test/test.db
OK
OK
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://ABCDEFG/test/tmp/test_log/2013-04-09.csv.gz is not a directory or unable to create one)

======================
END HIVE FAILURE OUTPUT
======================

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/spark/python/pyspark/sql/context.py", line 580, in sql
    return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql.
: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://ABCDEFG/test/tmp/test_log/2013-04-09.csv.gz is not a directory or unable to create one)

table schema is something like this 表架构是这样的

CREATE TABLE IF NOT EXISTS test_log (
    testid INT, 
    create_dt STRING
) 
PARTITIONED BY (create_date STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE

I converted some paths in command because of my privacy so typographical error can exist. 由于我的隐私，我在命令中转换了一些路径，因此可能存在印刷错误。 please don't pay attention to file name 请不要注意文件名

Answer 1

You should only specify till the folder when you are creating/altering the location of a hive table 创建/更改配置单元表的位置时，应仅指定到文件夹为止

ALTER TABLE test_log ADD PARTITION (create_date='2013-04-09') LOCATION '/test/tmp/test_log/create_date=2013-04-09/'

Put the file in the location 将文件放在该位置

hadoop fs -put /test/test.db/test_log/create_date=2013-04-09/create_date=2013-04-09.csv.gz

每个文件一个配置单元分区

问题描述

1 个解决方案

解决方案1
0 2017-10-27 07:40:58

每个文件一个配置单元分区

问题描述

1 个解决方案

解决方案1 0 2017-10-27 07:40:58

解决方案1
0 2017-10-27 07:40:58