简体   繁体   English

如何使用Java有效地读取Hadoop(HDFS)文件中的第一行?

[英]How to read first line in Hadoop (HDFS) file efficiently using Java?

I have a large CSV file on my Hadoop cluster. 我的Hadoop集群上有一个大的CSV文件。 The first line of the file is a 'header' line, which consists of field names. 该文件的第一行是“标题”行,由字段名称组成。 I want to do an operation on this header line, but I do not want to process the whole file. 我想对这个标题行进行操作,但我不想处理整个文件。 Also, my program is written in Java and using Spark. 此外,我的程序是用Java编写的,并使用Spark。

What is an efficient way to read just the first line of a large CSV file on an Hadoop cluster? 在Hadoop集群上只读取大型CSV文件的第一行的有效方法是什么?

You can access hdfs files with FileSystem class and friends: 您可以使用FileSystem类和朋友访问hdfs文件:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;

DistributedFileSystem fileSystem = new DistributedFileSystem();
Configuration conf = new Configuration();
fileSystem.initialize(new URI("hdfs://namenode-host:54310"), conf);
FSDataInputStream input = fileSystem.open(new Path("/path/to/file.csv"));
System.out.println((new BufferedReader(new InputStreamReader(input))).readLine());

This code won't use MapReduce and will run with a reasonable speed. 此代码不会使用MapReduce,并且将以合理的速度运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM