简体   繁体   English

将非结构化 CSV 数据加载到 Hive

[英]Loading unstructured CSV data into Hive

I would like to load a CSV file that contains 250000 posts from Stack Exchange into Hive.我想将一个包含来自 Stack Exchange 的 250000 个帖子的 CSV 文件加载到 Hive 中。 The CSV takes the following format: CSV 采用以下格式:

    Id  Score   ViewCount   ParentId    Body    DisplayName rnk

Every field is delimited by a "," but the field that screws everything up is Body.每个字段都以“”分隔,但将所有内容搞砸的字段是 Body。

Body contains the contents of the top 250000 posts on the website so there's all sort of characters in there, so there's one post per row with 250000 rows. Body 包含网站上前 250000 个帖子的内容,所以里面有各种各样的字符,所以每行有一个帖子,有 250000 行。

I've read up on Serde and Regexp but I am still getting null values in my Hive table.我已经阅读了 Serde 和 Regexp,但我的 Hive 表中仍然出现空值。

    CREATE TABLE dataStore(Id string, Score string, ViewCount string,     ParentId string, Body String, DisplayName String, Rank String)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar"     = """",
    "escapeChar"    = "\"
    )  
    STORED AS TEXTFILE;

I normally use ogrodnek's serde , you might have more luck with that.我通常使用ogrodnek 的 serde ,你可能会更幸运。 Also I don't think you're escaping your special character properly, I believe you need另外我认为你没有正确地逃避你的特殊角色,我相信你需要

"quoteChar"     = "\"",
"escapeChar"    = "\\"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM