输入大小〜2Mb较慢的Hadoop mapreduce

Question

I tried to distribute a calculation using hadoop. 我尝试使用hadoop分发计算。

I am using Sequence input and output files, and custom Writables. 我正在使用Sequence输入和输出文件以及自定义可写文件。

The input is a list of triangles, maximum size 2Mb, but can be smaller around 50kb too. 输入是一个三角形列表，最大大小为2Mb，但也可以较小，约为50kb。 The intermediate values and the output is a map(int,double) in the custom Writable. 中间值和输出是自定义Writable中的map（int，double）。 Is this the bottleneck? 这是瓶颈吗？

The problem is that the calculation is much slower than the version without hadoop. 问题是计算比没有hadoop的版本要慢得多。 also, increasing the nodes from 2 to 10, doesn't speed up the process. 同样，将节点数从2增加到10并不会加快该过程。

One possibility is that I don't get enough mappers because of the small input size. 一种可能是由于输入量太小，我没有足够的映射器。 I made tests changing the mapreduce.input.fileinputformat.split.maxsize , but it just got worse, not better. 我进行了更改mapreduce.input.fileinputformat.split.maxsize测试，但结果变得更糟，而不是更好。

I am using hadoop 2.2.0 locally, and at amazon elastic mapreduce. 我在本地使用hadoop 2.2.0，并在Amazon Elastic Mapreduce上使用。

Did I overlook something? 我有事吗 Or this is just the kind of task which should be done without hadoop? 还是这只是没有Hadoop才能完成的任务？ (it's my first time using mapreduce). （这是我第一次使用mapreduce）。

Would you like to see code parts? 您想看一下代码部分吗？

Thank you. 谢谢。

public void map(IntWritable triangleIndex, TriangleWritable triangle, Context context) throws IOException, InterruptedException {           
        StationWritable[] stations = kernel.newton(triangle.getPoints());
        if (stations != null) {
            for (StationWritable station : stations) {
                context.write(new IntWritable(station.getId()), station);
            }
        }
    }    


class TriangleWritable implements Writable {

private final float[] points = new float[9];

@Override
public void write(DataOutput d) throws IOException {
    for (int i = 0; i < 9; i++) {
        d.writeFloat(points[i]);
    }
}

@Override
public void readFields(DataInput di) throws IOException {
    for (int i = 0; i < 9; i++) {
        points[i] = di.readFloat();
    }
}
}

public class StationWritable implements Writable {

private int id;
private final TIntDoubleHashMap values = new TIntDoubleHashMap();

StationWritable(int iz) {
    this.id = iz;
}

@Override
public void write(DataOutput d) throws IOException {
    d.writeInt(id);
    d.writeInt(values.size());
    TIntDoubleIterator iterator = values.iterator();
    while (iterator.hasNext()) {
        iterator.advance();
        d.writeInt(iterator.key());
        d.writeDouble(iterator.value());
    }
}

@Override
public void readFields(DataInput di) throws IOException {
    id = di.readInt();

    int count = di.readInt();
    for (int i = 0; i < count; i++) {
        values.put(di.readInt(), di.readDouble());
    }
}
}

Answer 1

You won't get any benefit from hadoop with only 2MB of data. 仅拥有2MB数据的hadoop不会给您带来任何好处。 Hadoop is all about big data. Hadoop就是关于大数据的。 Distributing the 2MB to your 10 nodes costs more time then just doing the job on a single node. 将2MB分配到10个节点将花费更多的时间，而不仅仅是在单个节点上完成工作。 The real benfit starts with a high number of nodes and huge data. 真正的好处始于大量的节点和巨大的数据。

Answer 2

If the processing is really that complex, you should be able to realize a benefit from using Hadoop. 如果处理确实如此复杂，那么您应该能够从使用Hadoop中受益。

The common issue with small files, is that Hadoop will run a single java process per file and that will create overhead from having to start many processes and slows down the output. 小文件的常见问题是，Hadoop将为每个文件运行一个Java进程，这将因必须启动多个进程而增加开销，并减慢输出速度。 In your case this does not sound like it applies. 就您而言，这听起来并不适用。 More likely you have the opposite problem that only one Mapper is trying to process your input and it doesn't matter how big your cluster is at that point. 您更有可能遇到相反的问题，即只有一个Mapper试图处理您的输入，此时群集的大小无关紧要。 Using the input split sounds like the right approach, but because your use case is specialized and deviates significantly from the norm, you may need to tweak a number of components to get the best performance. 使用输入拆分听起来像是正确的方法，但是由于您的用例是专用的，并且与规范有明显出入，因此您可能需要调整许多组件才能获得最佳性能。

So you should be able to get the benefits you are seeking from Hadoop Map Reduce, but it will probably take significant tuning and custom Input handling. 因此，您应该能够从Hadoop Map Reduce中获得所需的好处，但是这可能需要大量的调整和自定义输入处理。

That said seldom(never?) will MapReduce be faster than a purpose built solution. 话虽如此，MapReduce很少（永远不会吗）会比专门构建的解决方案更快。 It is a generic tool that is useful in that it can be used to distribute and solve many diverse problems without the need to write a purpose built solution for each. 它是一个通用工具，非常有用，因为它可用于分发和解决许多不同的问题，而无需为每个问题编写专门构建的解决方案。

Answer 3

So at the end I figured out a way to not store intermediate values in writables, only in the memory. 因此，最后我想出了一种方法，仅在内存中不将中间值存储在可写文件中。 This way it is faster. 这样可以更快。 But still, a non-hadoop solution is the best in this usecase. 但是，在这种用例中，非Hadoop解决方案是最好的。

输入大小〜2Mb较慢的Hadoop mapreduce

问题描述

3 个解决方案

解决方案1
4 2014-02-18 15:23:24

解决方案2
1 已采纳 2014-03-03 20:47:39

解决方案3
0 2014-03-03 16:29:33

输入大小〜2Mb较慢的Hadoop mapreduce

问题描述

3 个解决方案

解决方案1 4 2014-02-18 15:23:24

解决方案2 1 已采纳 2014-03-03 20:47:39

解决方案3 0 2014-03-03 16:29:33

解决方案1
4 2014-02-18 15:23:24

解决方案2
1 已采纳 2014-03-03 20:47:39

解决方案3
0 2014-03-03 16:29:33