在 Amazon Athena 中访问 S3 CSV 文件

Question

I am trying to load a files from s3 to Athena to perform a query operation.我正在尝试将文件从 s3 加载到 Athena 以执行查询操作。 But all the column values are getting added to the first column.但是所有列值都被添加到第一列。

I have file in the following format:我有以下格式的文件：

id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE

This is the output I get:这是我得到的 output：

Table creation query:建表查询：

   CREATE EXTERNAL TABLE `testing`(
      `id` string, 
      `user_id` string, 
      `personal_id` string, 
      `created_at` string, 
      `updated_at` string, 
      `active` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://testing2fa/'
    TBLPROPERTIES (
      'transient_lastDdlTime'='1665356861')

Please can someone tell me where am I going wrong?请有人能告诉我我哪里出错了吗？

Answer 1

You should add skip.header.line.count to your table properties to skip the first row.您应该将skip.header.line.count添加到表属性以跳过第一行。 As you have defined all columns as string data type Athena was unable to differentiate between header and first row.由于您已将所有列定义为字符串数据类型，Athena 无法区分 header 和第一行。

DDL with property added:添加了属性的 DDL：

CREATE EXTERNAL TABLE `testing`(
      `id` string, 
      `user_id` string, 
      `personal_id` string, 
      `created_at` string, 
      `updated_at` string, 
      `active` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://testing2fa/'
    TBLPROPERTIES ('skip.header.line.count'='1')

Answer 2

The Serde needs some parameter to recognize CSV files, such as: Serde需要一些参数来识别CSV个文件，例如：

    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'

See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena请参阅： CSV、TSV 和自定义分隔文件的 LazySimpleSerDe - Amazon Athena

An alternative method is to use AWS Glue to create the tables for you .另一种方法是使用 AWS Glue 为您创建表。 In the AWS Glue console, you can create a Crawler and point it to your data.在 AWS Glue 控制台中，您可以创建一个 Crawler并将其指向您的数据。 When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.当您运行爬网程序时，它会自动在 Amazon Athena 中创建一个与提供的数据文件相匹配的表定义。

在 Amazon Athena 中访问 S3 CSV 文件

问题描述

2 个解决方案

解决方案1
1 2022-10-10 02:08:48

解决方案2
1 2022-10-10 02:12:57

在 Amazon Athena 中访问 S3 CSV 文件

问题描述

2 个解决方案

解决方案1 1 2022-10-10 02:08:48

解决方案2 1 2022-10-10 02:12:57

解决方案1
1 2022-10-10 02:08:48

解决方案2
1 2022-10-10 02:12:57