简体   繁体   English

使用Hadoop Map-Reduce去除不可打印的字符

[英]Strip non-printable characters using Hadoop Map-Reduce

I am trying to process a HDFS file which has non-printable chracters. 我正在尝试处理具有不可打印特征的HDFS文件。 I wish to strip these characters out using MapReduce. 我希望使用MapReduce去除这些字符。

I have tried using Pig TextLoader and MR TextInputFormat (IN MR program), and it results to split of a record into multiple from the position where it encounters a non-printable characters. 我已经尝试过使用Pig PigLoader和MR TextInputFormat(IN MR程序),它的结果是将记录从遇到不可打印字符的位置拆分为多个。 Below is the sample data: 下面是示例数据:

===Data==(2 records)= ===数据==(2条记录)=

 4614:2011-12-20-08.45.08.169176^2011-12-20-18.15.08.100008^597^0^57^ZUKA^Grase^^^Grase,Dr^^^N^N^N^Dr^KG^ONLY INFORMATION ENTERED^UNKNOWN^0     ^      ^^^611190362
�^0^^^^^^0^Grase,Dr^^^, ,^^^^^^597^^^<fnm>DR</fnm><lnm>GRASE</lnm>^^^^^^^^SINGLE^0^0
6063:2010-07-04-04.00.00.100001^2010-07-04-04.01.00.180144^017^0^095^WEETE    ^Wien^^^Wien,Colock^^^N^N^N^Colock^KG^ONLY INFORMATION ENTERED^UNKNOWN^0     ^      ^295111915^^������9905^0^^^^^^0^Wien,Colock^40001 KIN RD^300 CAMORE ST^500 BLACK AVE^Woesfield, HA, 43723.^John Ball^^^25719110^617^������9905^^<fnm>COLOCK</fnm><lnm>WIEN</lnm>^^^^^^^^SINGLE^0^0

[In less editor, the first record for column with special characters looks as: [在少编辑器中,具有特殊字符的列的第一条记录如下:

611190362^M<EF><BF><BD> ] 611190362^M<EF><BF><BD> ]

In vi or less the first record comes in one line, but while reading in MR or pig this record gets splitted due to presence of these characters. 在vi或以下版本中,第一条记录排成一行,但是在MR或Pig中读取时,由于这些字符的存在,该记录被拆分了。

I want to avoid the record splitting into new line while reading data from HDFS and further wish to process these records to get rid of these special characters. 我想避免在从HDFS读取数据时将记录分成新行,并希望处理这些记录以摆脱这些特殊字符。

Here is what i have tried with basic UDF (snippet below). 这是我尝试使用基本UDF的内容(下面的代码段)。 Though, the program is striping the charcters > 0x80 but performs stripping on splitted record. 虽然,该程序将字符> 0x80剥离,但是对拆分的记录执行剥离。

Any help / pointers would be appreciated!! 任何帮助/指针将不胜感激!

        /*
     * 
     * Pig Code:
     * register '/basepath/udf/DIF.jar'
    rel1 = LOAD '/user/home/test' USING USING TextLoader();
    rel2 = FOREACH rel1 GENERATE StripNonPrintable(s) as recordline;
    dump rel2;

     * 
     */

  //Imports    

    public class StripNonPrintable extends EvalFunc<String>
    {
         public String exec(Tuple input) throws IOException {
             if (input == null || input.size() == 0)
                 return null;
             try{

                    String s = new String();
                     s = (String)input.get(0);
                    //s = "2001-12-20-08.45.08.169176^2001-12-20-08.45.08.131408^597^0^57^ZUCKA^Grase^^^Grase,Dr^^^N^N^N^Dr^KG^ONLY INFORMATION ENTERED^UNKNOWN^0     ^      ^^^6785790362�^0^^^^^^0^Grase,Dr^^^, ,^^^^^^597^^^<fnm>DR</fnm><lnm>GRASE</lnm>^^^^^^^^SINGLE^0^0";

                    int length = s.length();
                    char[] oldChars = new char[length];
                    s.getChars(0, length, oldChars, 0);
                    int newLen = 0;
                    for (int j = 0; j < length; j++) {
                        char ch = oldChars[j];
                        if (ch < 0x80  ) {
                            oldChars[newLen] = ch;
                            newLen++;
                        }
                    }
                    s = new String(oldChars, 0, newLen);

                    //System.out.println("New String = \n " + s);

                    return s;
             }catch(Exception e){
                 return null ;
             }
         }

     }

Package java.lang.Character . 软件包java.lang.Character has a function getType which: 有一个功能getType

Returns a value indicating a character's general category 返回一个值,该值指示角色的一般类别

Import java.lang.Character and replace: 导入java.lang.Character并替换:

if (ch < 0x80  )

with following code: 使用以下代码:

int c = Character.getType(ch);
if(c != Character.CONTROL || 
            c != Character.CONNECTOR_PUNCTUATION || 
            c != Character.CURRENCY_SYMBOL || 
            c != Character.DASH_PUNCTUATION || 
            c != Character.DECIMAL_DIGIT_NUMBER || 
            c != Character.ENCLOSING_MARK || 
            c != Character.END_PUNCTUATION || 
            c != Character.FINAL_QUOTE_PUNCTUATION || 
            c != Character.INITIAL_QUOTE_PUNCTUATION || 
            c != Character.LETTER_NUMBER || 
            c != Character.LOWERCASE_LETTER || 
            c != Character.MATH_SYMBOL || 
            c != Character.MODIFIER_LETTER || 
            c != Character.MODIFIER_SYMBOL || 
            c != Character.OTHER_LETTER || 
            c != Character.OTHER_NUMBER || //remove it if you want to get rid of ½ 
            c != Character.OTHER_PUNCTUATION || 
            c != Character.OTHER_SYMBOL || 
            c != Character.START_PUNCTUATION || 
            c != Character.TITLECASE_LETTER || 
            c != Character.UPPERCASE_LETTER)

Use combinations of these, to remove characters you don't need. 使用这些的组合,以删除不需要的字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM