简体   繁体   English

Hadoop / Hive:在本地计算机上从.csv加载数据

[英]Hadoop/Hive : Loading data from .csv on a local machine

As this is coming from a newbie... 因为这是来自新手......

I had Hadoop and Hive set up for me, so I can run Hive queries on my computer accessing data on AWS cluster. 我为我设置了Hadoop和Hive,因此我可以在计算机上运行Hive查询来访问AWS集群上的数据。 Can I run Hive queries with .csv data stored on my computer, like I did with MS SQL Server? 我可以使用存储在计算机上的.csv数据运行Hive查询,就像我使用MS SQL Server一样吗?

How do I load .csv data into Hive then? 如何将.csv数据加载到Hive中呢? What does it have to do with Hadoop and which mode I should run that one? 它与Hadoop有什么关系以及我应该运行哪种模式?

What settings I should care about so that if I did something wrong I can always go back and run queries on Amazon without compromising what was set up for me earlier? 我应该关注什么设置,这样如果我做错了什么我总是可以回去在亚马逊上运行查询而不会影响之前为我设置的内容?

Let me work you through the following simple steps: 让我通过以下简单步骤为您完成工作:

Steps: 脚步:

First, create a table on hive using the field names in your csv file. 首先,使用csv文件中的字段名称在配置单元上创建一个表。 Lets say for example, your csv file contains three fields (id, name, salary) and you want to create a table in hive called "staff". 让我们举例来说,你的csv文件包含三个字段(id,name,salary),你想在hive中创建一个名为“staff”的表。 Use the below code to create the table in hive. 使用以下代码在配置单元中创建表。

hive> CREATE TABLE Staff (id int, name string, salary double) row format delimited fields terminated by ',';

Second, now that your table is created in hive, let us load the data in your csv file to the "staff" table on hive. 其次,既然您的表是在hive中创建的,那么让我们将csv文件中的数据加载到配置单元的“staff”表中。

hive>  LOAD DATA LOCAL INPATH '/home/yourcsvfile.csv' OVERWRITE INTO TABLE Staff;

Lastly, display the contents of your "Staff" table on hive to check if the data were successfully loaded 最后,在配置单元上显示“Staff”表的内容,以检查数据是否已成功加载

hive> SELECT * FROM Staff;

Thanks. 谢谢。

if you have a hive setup you can put the local dataset directly using Hive load command in hdfs/s3. 如果你有一个配置单元,你可以直接使用Hdfs / s3中的Hive load命令放置本地数据集。

You will need to use "Local" keyword when writing your load command. 编写load命令时需要使用“Local”关键字。

Syntax for hiveload command hiveload命令的语法

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

Refer below link for more detailed information. 有关详细信息,请参阅以下链接。 https://cwiki.apache.org/confluence/display/Hive/LanguageManual%20DML#LanguageManualDML-Loadingfilesintotables https://cwiki.apache.org/confluence/display/Hive/LanguageManual%20DML#LanguageManualDML-Loadingfilesintotables

There is another way of enabling this, 有另一种方法来实现这一点,

  1. use hadoop hdfs -copyFromLocal to copy the .csv data file from your local computer to somewhere in HDFS, say '/path/filename' 使用hadoop hdfs -copyFromLocal将.csv数据文件从本地计算机复制到HDFS中的某个位置,比如'/ path / filename'

  2. enter Hive console, run the following script to load from the file to make it as a Hive table. 进入Hive控制台,运行以下脚本从文件加载,使其成为Hive表。 Note that '\\054' is the ascii code of 'comma' in octal number, representing fields delimiter. 请注意,'\\ 054'是八进制数字中'逗号'的ascii代码,表示字段分隔符。


CREATE EXTERNAL TABLE table name (foo INT, bar STRING)
 COMMENT 'from csv file'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS TEXTFILE
 LOCATION '/path/filename';

For csv file formate data will be in below format 对于csv文件格式,数据将采用以下格式

"column1", "column2","column3","column4"

And if we will use field terminated by ',' then each column will get values like below. 如果我们将使用以','结尾的字段,那么每列将获得如下所示的值。

"column1"    "column2"     "column3"     "column4"

also if any of the column value has comma as value then it will not work at all . 如果列值中的任何一个以comma作为值,那么它根本不起作用。

So the correct way to create a table would be by using OpenCSVSerde 因此,创建表的正确方法是使用OpenCSVSerde

create table tableName (column1 datatype, column2 datatype , column3 datatype , column4 datatype)
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
STORED AS TEXTFILE ;

You may try this, Following are few examples on how files are generated. 您可以试试这个,以下是关于如何生成文件的几个示例。 Tool -- https://sourceforge.net/projects/csvtohive/?source=directory 工具 - https://sourceforge.net/projects/csvtohive/?source=directory

  1. Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/ 使用Browse选择一个CSV文件并设置hadoop根目录ex:/ user / bigdataproject /

  2. Tool Generates Hadoop script with all csv files and following is a sample of generated Hadoop script to insert csv into Hadoop Tool使用所有csv文件生成Hadoop脚本,以下是生成的Hadoop脚本示例,用于将csv插入Hadoop

     #!/bin/bash -v 
    hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv hive -f ./AllstarFull.hive

    \n\n

    hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv hive -f ./Appearances.hive hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv hive -f ./Appearances.hive

    \n\n

    hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv hive -f ./AwardsManagers.hive hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv hive -f ./AwardsManagers.hive

  3. Sample of generated Hive scripts 生成的Hive脚本示例

     CREATE DATABASE IF NOT EXISTS lahman; 
    USE lahman;
    CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;
    LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;
    SELECT * FROM AllstarFull;

Thanks Vijay 谢谢Vijay

You can load local CSV file to Hive only if: 只有在以下情况下才能将本地CSV文件加载到Hive:

  1. You are doing it from one of the Hive cluster nodes. 您正在从其中一个Hive集群节点执行此操作。
  2. You installed Hive client on non-cluster node and using hive or beeline for upload. 您在非群集节点上安装了Hive客户端,并使用hivebeeline进行上载。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM