简体   繁体   English

在hadoop中查找不同列的最有效方法是什么

[英]What is the most effective way to find distinct of columns in hadoop

I have a file of size 1TB. 我有一个1TB的文件。 And we need to find the distinct values for 4 columns in the file. 并且我们需要找到文件中4列的不同值。 So for example if we have columns A,B,C,D,E,F and so on. 因此,例如,如果我们有A,B,C,D,E,F等列。 Among them we need to find all the distinct values in column A and create one file in HDFS. 其中我们需要在A列中找到所有不同的值,并在HDFS中创建一个文件。 Similarly for B,C and D. 对于B,C和D同样如此。

Note: We have to do this for only 4 Columns not for the remaining. 注意:我们仅需要对4列执行此操作,而对于其余列则不需要。 There are total of 300 columns in the file. 该文件中共有300列。

We need to write Map Reduce for this. 为此,我们需要编写Map Reduce。 What would be an effective way to address this problem. 什么是解决此问题的有效方法。 Appreciate your help. 感谢您的帮助。 Thanks. 谢谢。

Let the mapper output a record for each column you need the unique values. 让映射器为您需要唯一值的每一列输出一条记录。 So in your example the map will (with a single input record) output 4 records with the key being A,B,C,D. 因此,在您的示例中,地图将(具有单个输入记录)输出4个记录,键为A,B,C,D。

In the reducer you can then handles all the values. 然后可以在化简器中处理所有值。

Depending on the details of what you need you may want to use a key that looks something like this: "A:value of column A" 根据您需要的详细信息,您可能需要使用如下所示的键:“ A:A列的值”

Basically you need to filter out duplicate records, this can be done in few steps begin from mapper, combiner and reducer. 基本上,您需要过滤掉重复的记录,这可以从映射器,合并器和简化器开始的几个步骤中完成。 Also you can use java Set. 您也可以使用java Set。 Mapper can output key= 'Column A' Value='complete record'. 映射器可以输出key ='Column A'Value ='complete record'。 Store keys in Set, if set contain key don't emit record. 将密钥存储在Set中,如果set中包含密钥,则不会发出记录。 Same can be done in combiner. 可以在合并器中完成相同的操作。 And may be you don't need reducer. 也许您不需要减速器。 Also need to workout so Set does not create out of memory error by clearing on some specific size. 还需要锻炼,因此Set不会因清除某些特定大小而造成内存不足错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 坚持HashMap的最有效方法是什么? - what is the most effective way to persist HashMap? 在用户输入的相关部分和数组之间找到匹配项的最有效方法是什么? - What is the most effective way to find a match between the relevant parts of user input and an array? 在Java中找到置换的最有效方法[具有较低的渐近复杂度] - Most effective way to find permutation in Java [with a lower asymptotic complexity] 从int值创建BigInteger实例的最有效方法是什么? - What is the most effective way to create BigInteger instance from int value? 重构这个简单方法的最有效方法是什么? - What's the most effective way to refactor this simple method? 从服务器加载谷歌地图标记的最有效方法是什么 - What is the most effective way to load google maps markers from server 在IntelliJ IDEA中创建“新ArrayList”的最快/最有效方法是什么 - What is the fastest/most effective way to create a “new ArrayList” in IntelliJ IDEA 在GCS上解压缩gzip文件的最有效方法是什么? - What is the most effective way to make unzipped copy of a gzipped file on GCS? 读取大量文件名的最有效方法是什么? - What is the most effective way to read names of a lot of files? 在 Jsp 中打印字符串数组的 Arraylist 最有效的方法是什么? - What is the most effective way of printing Arraylist of string array in Jsp?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM