简体   繁体   English

将字符串中的所有特殊字符高效编码为实体

[英]Efficient encoding all the special characters in a string into entities

I have a string like this "abcd !@&$%^^&*()<>!/".我有一个像这样的字符串“abcd !@&$%^^&*()<>!/”。 I have list of all the entity codes for characters in a separate string ie only encode those characters which are in another string "!=&4....^=9...".我在一个单独的字符串中列出了所有字符的实体代码,即只对另一个字符串“!=&4....^=9...”中的那些字符进行编码。 I want to convert all of special characters into their entities except alphanumeric by regex as using loop on characters on by one is too slow.我想通过正则表达式将所有特殊字符转换为它们的实体(字母数字除外),因为对一个字符使用循环太慢了。

eg it should show "abc &#4..;&#4.." in other convert words all the special characters on keyboard.例如,它应该显示“abc &#4..;&#4..”,换句话说,将键盘上的所有特殊字符转换为单词。

Is there an efficient regex I can write ?我可以写一个有效的正则表达式吗? I have tried this with loops but it is too slow to look at each character one by one and maintain a list of all special characters entities in other string我已经用循环尝试过这个,但是一个一个地查看每个字符并维护其他字符串中所有特殊字符实体的列表太慢了

There are libraries but they do not convert all of the characters.有库,但它们不会转换所有字符。

The code I wrote我写的代码

// String to be encoded

String sDecoded = "abcd !@#$%^&*();'m,";
// Special character entity list to put instead to special character. It is     tokenized on cross and divide symbol as it cannot be entered by user on keyboard

String specialCharacters = "&÷$amp;×–÷&ndash;"


// Check the input
if (sDecoded == null || sDecoded.trim ().length () == 0)
  return (sDecoded);

// Use StringTokenizer which is faster than split method
StringTokenizer st = new StringTokenizer(specialCharacters, "×");
String[] reg = null;
String[] charactersArray = sDecoded.split("");
String sEncoded = "";

// now loop on it and in each iteration, we will be getting a decodedCharacter:EncodedEntity pair 


for(int i = 0; i < charactersArray.length; i++)
{   
    st = new StringTokenizer(specialCharacters, "×");


    while(st.hasMoreElements())
    {
        reg = st.nextElement().toString().split("÷");

         // This is an error, the character should not be blank ever because it will be character that we will encode
         if(StringUtils.isBlank(reg[0]))
            return sDecoded;

        String c = charactersArray[i];


        if(c.equalsIgnoreCase(reg[0]))
        {
            sEncoded = sEncoded + c.replace(reg[0], reg[1]);
            break;
        }

        if(st.countTokens() == 0)
            sEncoded = sEncoded + c.toString();

 }

}

    return (sEncoded);

I don't know what definition of "efficient" you are using, but there's the "don't reinvent the wheel" efficiency of using a simple call to Apache commons StringEscapeUtils utility class: 我不知道您使用的“效率”的定义是什么,但是使用对Apache commons StringEscapeUtils实用程序类的简单调用会产生“不要重新发明”的效率:

String encoded = StringEscapeUtils.escapeXml11(str);

or 要么

String encoded = StringEscapeUtils.escapeHtml4(str);

and a variety of other similar methods, depending on which exact encoding you want. 以及各种其他类似方法,具体取决于您要使用的确切编码。

Your approach is quite slow and inefficient.您的方法非常缓慢且效率低下。 Maybe it looks elegant nowadays to use regex like a silver bullet for everything, but it is definitely not for this task.也许现在将正则表达式用作万事万物的灵丹妙药看起来很优雅,但它绝对不适合这项任务。 I see you are also using tokenizer which is also slow.Also loop inside a loop will degrade performance.我看到你也在使用标记器,它也很慢。循环内的循环也会降低性能。

I would recomment using an iterative way with string builder which will produce blazing fast results, you will try for yourself.我建议使用字符串生成器的迭代方式,这将产生极快的结果,您将自己尝试。 For each special character make an 'if' statement.为每个特殊字符做一个“if”语句。 Even if it looks too much code it will be very fast.即使看起来代码太多,它也会非常快。 Test yourself.测试自己。 Try this :尝试这个 :

class Scratch {

    public static void main(String[] args) {
        System.out.println(escapeSpecials("abc &"));
    }

    public static String escapeSpecials(String origin) {
        StringBuilder result = new StringBuilder();
        char[] chars = origin.toCharArray();
        for (char c : chars) {
            if (c == '&') {
                result.append("&amp;");
            } else if (c == '\u2013') {
                result.append("&ndash;");
            } else {
                // not a special character
                result.append(c);
            }
        }
        return result.toString();
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM