[英]How to Load data from CSV into separate Hadoop HDFS directories based on fields
I have a CSV of data and I need to load it into HDFS directories based on a certain field (year).我有一个 CSV 数据,我需要根据某个字段(年份)将其加载到 HDFS 目录中。 I am planning to use Java.
我打算使用Java。 I have looked at using BufferedReader however I am having trouble implementing it.
我已经看过使用 BufferedReader 但是我在实现它时遇到了麻烦。 Would this be the optimal thing to use for this task or is there a better way?
这是用于此任务的最佳方法还是有更好的方法?
Use Spark to read the CSV into a dataframe.使用 Spark 将 CSV 读入数据帧。
use partitionBy("year")
during writing to HDFS, and it'll create sub-folders under the path starting with year=
for each unique value.在写入 HDFS 期间使用
partitionBy("year")
,它将在以year=
开头的路径下为每个唯一值创建子文件夹。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.