在Hadoop /层叠中从FTP服务器读取数据

Question

I want to read data from FTP Server.I am providing path of the file which resides on FTP server in the format ftp://Username:Password@host/path . 我想从FTP服务器读取数据，我以ftp：// Username：Password @ host / path的格式提供位于FTP服务器上的文件的路径。 When I use map reduce program to read data from file it works fine. 当我使用map reduce程序从文件读取数据时，它可以正常工作。 I want to read data from same file through Cascading framework. 我想通过Cascading框架从同一文件读取数据。 I am using Hfs tap of cascading framework to read data. 我正在使用Hfs级联框架的水龙头来读取数据。 It throws following exception 它引发以下异常

java.io.IOException: Stream closed
    at org.apache.hadoop.fs.ftp.FTPInputStream.close(FTPInputStream.java:98)
    at java.io.FilterInputStream.close(Unknown Source)
    at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
    at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:254)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

Below is the code of cascading framework from where I am reading the files: 下面是我从中读取文件的级联框架的代码：

public class FTPWithHadoopDemo {
    public static void main(String args[]) {
        Tap source = new Hfs(new TextLine(new Fields("line")), "ftp://user:pwd@xx.xx.xx.xx//input1");
        Tap sink = new Hfs(new TextLine(new Fields("line1")), "OP\\op", SinkMode.REPLACE);
        Pipe pipe = new Pipe("First");
        pipe = new Each(pipe, new RegexSplitGenerator("\\s+"));
        pipe = new GroupBy(pipe);
        Pipe tailpipe = new Every(pipe, new Count());
        FlowDef flowDef = FlowDef.flowDef().addSource(pipe, source).addTailSink(tailpipe, sink);
        new HadoopFlowConnector().connect(flowDef).complete();
    }
}

I tried to look in Hadoop Source code for the same exception. 我试图在Hadoop源代码中查找相同的异常。 I found that in the MapTask class there is one method runOldMapper which deals with stream. 我发现在MapTask类中，有一个方法runOldMapper处理流。 And in the same method there is finally block where stream gets closed (in.close()) . 并且在同一方法中，流最终被关闭的最后一个块（in.close（）） 。 When I remove that line from finally block it works fine. 当我从finally块中删除该行时，它工作正常。 Below is the code: 下面是代码：

private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex,
            final TaskUmbilicalProtocol umbilical, TaskReporter reporter)
                    throws IOException, InterruptedException, ClassNotFoundException {
        InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset());

        updateJobWithSplit(job, inputSplit);
        reporter.setInputSplit(inputSplit);

        RecordReader<INKEY, INVALUE> in = isSkipping()
                ? new SkippingRecordReader<INKEY, INVALUE>(inputSplit, umbilical, reporter)
                : new TrackedRecordReader<INKEY, INVALUE>(inputSplit, job, reporter);
        job.setBoolean("mapred.skip.on", isSkipping());

        int numReduceTasks = conf.getNumReduceTasks();
        LOG.info("numReduceTasks: " + numReduceTasks);
        MapOutputCollector collector = null;
        if (numReduceTasks > 0) {
            collector = new MapOutputBuffer(umbilical, job, reporter);
        } else {
            collector = new DirectMapOutputCollector(umbilical, job, reporter);
        }
        MapRunnable<INKEY, INVALUE, OUTKEY, OUTVALUE> runner = ReflectionUtils.newInstance(job.getMapRunnerClass(),
                job);

        try {
            runner.run(in, new OldOutputCollector(collector, conf), reporter);
            collector.flush();
        } finally {
            // close
            in.close(); // close input
            collector.close();
        }
    }

please assist me in solving this problem. 请协助我解决这个问题。

Thanks, Arshadali 谢谢，阿尔沙达利

Answer 1

After some efforts I found out that hadoop uses org.apache.hadoop.fs.ftp.FTPFileSystem Class for FTP. 经过一些努力，我发现hadoop使用org.apache.hadoop.fs.ftp.FTPFileSystem类作为FTP。
This class doesn't supports seek, ie Seek to the given offset from the start of the file. 此类不支持查找，即从文件开始处查找到给定的偏移量。 Data is read in one block and then file system seeks to next block to read. 在一个块中读取数据，然后文件系统寻求下一个块进行读取。 Default block size is 4KB for FTPFileSystem . FTPFileSystem默认块大小为4KB。 As seek is not supported it can only read data less than or equal to 4KB. 由于不支持搜寻，因此它只能读取小于或等于4KB的数据。

在Hadoop /层叠中从FTP服务器读取数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-04-20 05:57:12

在Hadoop /层叠中从FTP服务器读取数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-04-20 05:57:12

解决方案1
0 已采纳 2016-04-20 05:57:12