简体   繁体   English

如何在hadoop mapreduce中将文件分区为较小尺寸以执行KNN

[英]How to partition a file to smaller size for performing KNN in hadoop mapreduce

In KNN like algorithm we need to load model Data into cache for predicting the records. 在类似KNN的算法中,我们需要将模型数据加载到缓存中以预测记录。

Here is the example for KNN. 这是KNN的示例。

在此处输入图片说明

So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache. 因此,如果模型是一个大文件,例如1 GB或2 GB,我们将能够将它们加载到分布式缓存中。 Example: 例: 在此处输入图片说明 Inorder to predict 1 otcome, we need to find the distnce between that single record with all the records in model result and find the min distance. 为了预测1 otcome,我们需要找到该单个记录与模型结果中所有记录之间的距离,并找到最小距离。 So we need to get the model result in our hands. 因此,我们需要掌握模型结果。 And if it is large file it cannot be loaded into Distributed cache for finding distance. 如果文件很大,则无法将其加载到分布式缓存中以查找距离。

The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome. 一种方法是将模型Result拆分/划分为一些文件,并对该文件中的所有记录执行距离计算,然后找到classlabel的最小和最大出现次数并预测结果。

How can we parttion the file and perform the operation on these partition ? 我们如何分割文件并在这些分区上执行操作?

ie  1 record <Distance> file1,file2,....filen
     2nd record <Distance> file1,file2,...filen

This is what came to my thought. 这就是我的想法。

Is there any further way. 还有没有其他办法。

Any pointers would help me. 任何指针都会帮助我。

I think the way you partitionin the data mainly depends on your data itself. 我认为您对数据进行分区的方式主要取决于数据本身。

Being that you have a model with a bunch of rows, and that you want to find the k closes ones to the data on your input, the trivial solution is to compare them one by one. 假设您有一个包含一堆行的模型,并且想要找到k个与输入中的数据接近的行,那么简单的解决方案是将它们逐一进行比较。 This can be slow because of going through 1-2GB of data millions of times (I assume you have large numbers of records that you want to classify, otherwise you don't need hadoop). 这可能很慢,因为要经历1-2GB的数据数百万次(我假设您要分类的记录很多,否则不需要hadoop)。

That is why you need to prune your model efficiently (your partitioning) so that you can compare only those rows that are most likely to be the closest. 这就是为什么您需要有效地修剪模型(分区)以便仅比较那些最可能是最接近的行的原因。 This is a hard problem and requires knowledge of the data you operate on. 这是一个难题,需要您了解所操作的数据。

Additional tricks that you can use to fish out performance are: 您可以使用其他技巧来提高性能:

  • Pre-sorting the input data so that the input items that will be compared from the same partition come together. 对输入数据进行预排序,以便将来自同一分区的要比较的输入项放在一起。 Again depends on the data you operate on. 再次取决于您操作的数据。
  • Use random access indexed files (like Hadoop's Map files) to find the data faster and cache it. 使用随机访问索引文件(例如Hadoop的Map文件)可以更快地找到数据并将其缓存。

In the end it may actually be easier for your model to be stored in lucene index, so you can achieve effects of partitioning by looking up the index. 最后,将模型存储在Lucene索引中实际上可能更容易,因此您可以通过查找索引来实现分区的效果。 Pre-sorting the data is still helpful there. 在那里对数据进行预排序仍然很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM