简体   繁体   English

无法将带有 json 字符串的 csv 从 S3 加载到 redshift

[英]failing to load a csv with json string from S3 to redshift

I have a csv with rows that look the following way:我有一个 csv,其中的行看起来如下所示:

2021-08-20,2021-10-04,2021-10-04,148355456455712,Accountname,USD,"[{'action_type': 'add_to_cart', 'value': '266.63', '1d_click': '266.63', '7d_click': '266.63'}, {'action_type': 'initiate_checkout', 'value': '213.03', '1d_click': '213.03', '7d_click': '213.03'}, {'action_type': 'view_content', 'value': '762.75', '1d_click': '762.75', '7d_click': '762.75'}, {'action_type': 'omni_add_to_cart', 'value': '266.63', '1d_click': '266.63', '7d_click': '266.63'}, {'action_type': 'omni_initiated_checkout', 'value': '213.03', '1d_click': '213.03', '7d_click': '213.03'}, {'action_type': 'omni_view_content', 'value': '762.75', '1d_click': '762.75', '7d_click': '762.75'}, {'action_type': 'add_to_cart', 'value': '266.63', '1d_click': '266.63', '7d_click': '266.63'}, {'action_type': 'initiate_checkout', 'value': '213.03', '1d_click': '213.03', '7d_click': '213.03'}]"

I am trying to load this CSV to a redshift table with the following schema:我正在尝试将此 CSV 加载到具有以下架构的红移表中:

Columns             Type    Nullable    Length  Precision

date_start          varchar true        256     256
date_stop           varchar true        256     256
created_time        varchar true        256     256
account_id          int8    true        19      19
account_name        varchar true        256     256
account_currency    varchar true        256     256
action_values       varchar true        256     256

I'm using the following DML statement:我正在使用以下 DML 语句:

copy table_name
from 's3://bucket_name/subdirectory/filename.csv'
delimiter ','
ignoreheader 1
csv quote as '"'
dateformat 'auto'
timeformat 'auto'
access_key_id '...'
secret_access_key '...'
   ;

and i get this error: Load into table 'table_name' failed. Check 'stl_load_errors' system table for details.我收到此错误: Load into table 'table_name' failed. Check 'stl_load_errors' system table for details. Load into table 'table_name' failed. Check 'stl_load_errors' system table for details.

when I look at stl_load_errors table this is what i see:当我查看 stl_load_errors 表时,我看到的是:

query   substring   line    value           err_reason

93558   ...         2   2021-08-20          Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   2021-10-04          Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   2021-10-04          Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   148355456455712     Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   Accountname         Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   USD                 Invalid digit, Value '[', Pos 0, Type: Long

I just cant figure out why it isn't working but i guess it has something to do with the json string.我只是不知道为什么它不起作用,但我想它与 json 字符串有关。 Also I can't understand whare this "Type: Long" is coming from.我也不明白这个“类型:长”是从哪里来的。

I am trying to avoid using Json files as input...我试图避免使用 Json 文件作为输入...

can anyone help?有人可以帮忙吗?

Your data does look reasonable (as far as I can tell) except that the last field (the json data) is longer than 256 characters.除了最后一个字段(json 数据)超过 256 个字符外,您的数据看起来确实合理(据我所知)。 This however is not the error you are showing.然而,这不是您显示的错误。 The format of stl_load_errors isn't the format the table is in so I assume you are doing some processing on this table in what you are showing in your question. stl_load_errors 的格式不是表格的格式,所以我假设您正在按照您在问题中显示的内容对该表格进行一些处理。 The "Type:Long" is referring to the INT8 of your table DDL - INT8 is also known as a Big Int or a Long Int. “Type:Long”指的是表 DDL 的 INT8 - INT8 也称为 Big Int 或 Long Int。

I think your issue is that you haven't specified that the COPY is reading a CSV file and the default format is DELIMITED.我认为您的问题是您没有指定 COPY 正在读取 CSV 文件,并且默认格式为 DELIMITED。 For Redshift COPY to follow CSV rules you need to specify that the file format is CSV.为了让 Redshift COPY 遵循 CSV 规则,您需要指定文件格式为 CSV。 Specifically I suspect that the double quotes are providing the data value grouping you are expecting.具体来说,我怀疑双引号正在提供您期望的数据值分组。

I worked it out!我解决了!

since Redshift is parsing every \\' as a string and every \\, as a delimiter if not within a string.因为 Redshift 将每个\\'解析为字符串,将每个\\,解析为分隔符(如果不在字符串中)。 the solution is to replace all \\' with \\" within a json string so it would look the following way:解决方案是将 json 字符串中的所有\\'替换为\\" ,使其看起来如下所示:

2021-08-20,2021-10-04,2021-10-04,148355456455712,Accountname,USD,'[{"action_type": "add_to_cart", "value": "266.63", "1d_click": "266.63", "7d_click": "266.63"}, {"action_type": "initiate_checkout", "value": "213.03", "1d_click": "213.03", "7d_click": "213.03"}, {"action_type": "view_content", "value": "762.75", "1d_click": "762.75", "7d_click": "762.75"}, {"action_type": "omni_add_to_cart", "value": "266.63", "1d_click": "266.63", "7d_click": "266.63"}, {"action_type": "omni_initiated_checkout", "value": "213.03", "1d_click": "213.03", "7d_click": "213.03"}, {"action_type": "omni_view_content", "value": "762.75", "1d_click": "762.75", "7d_click": "762.75"}, {"action_type": "add_to_cart", "value": "266.63", "1d_click": "266.63", "7d_click": "266.63"}, {"action_type": "initiate_checkout", "value": "213.03", "1d_click": "213.03", "7d_click": "213.03"}]'

since I sed pandas the preprocessing code was the following:因为我 sed pandas 预处理代码如下:

for col in array_cols:
    df[col] = df[col].str.replace('\'','\"')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM