简体   繁体   English

ICU:Transliterate然后删除所有非字母数字字符

[英]ICU: Transliterate and then remove all non-alphanumeric characters

Can it be done with ICU without falling back to regex? 可以用ICU完成而不回退到正则表达式吗?

Currently I normalize filenames like this: 目前我规范化文件名,如下所示:

protected function normalizeFilename($filename)
{
    $transliterator = Transliterator::createFromRules(
        'Any-Latin; Latin-ASCII; [:Punctuation:] Remove;'
    );
    $filename = $transliterator->transliterate($filename);
    $filename = preg_replace('/[^A-Za-z0-9_]/', '', $filename);
    return $filename;

}

Can I get rid of regular expression here and do everything with ICU calls? 我可以在这里摆脱正则表达式并使用ICU呼叫做所有事情吗?

Use the correct tool for the job 使用正确的工具完成工作

I don't see anything wrong with what you're doing now. 我现在所做的事情没有任何问题。

ICU transliteration is first and foremost language oriented. ICU音译首先是语言导向。 It tries to preserve meaning. 它试图保持意义。

Regular expressions, on the other hand, can manipulate characters in detail, giving you the assurance that the file name is restricted to the selected characters. 另一方面,正则表达式可以详细操作字符,从而确保文件名仅限于所选字符。

The combination is perfect, in this case. 在这种情况下,这种组合是完美的。

I have, of course, looked for a solution to your question. 当然,我已经找到了解决问题的方法。 But to be honest, I couldn't find something that would work on all possible inputs. 但说实话,我找不到适用于所有可能输入的东西。

For instance, not all characters, we would consider punctuation marks, are removed by [:Punctuation:] Remove; 例如,并非所有字符,我们都会考虑标点符号,通过[:Punctuation:] Remove; . Try the Russian name: Корнильев, Кирилл . 请尝试俄语名称: Корнильев, Кирилл After applying your id it becomes: Kornilʹev Kirill . 申请你的id它变成了: Kornilʹev Kirill Clearly that's not a punctuation mark, but you don't want it in your file name. 显然,这不是标点符号,但您不希望它在您的文件名中。

So I would advice to use the correct tool for the job: 所以我建议使用正确的工具:

  1. Use ICU to get the best ASCII enquivalent. 使用ICU获得最佳的ASCII等价。 Only using Latin-ASCII; 仅使用Latin-ASCII; as the id will do. 正如id会做的那样。 Nice and simple. 很好,很简单。
  2. Then use a regular expression, just like you did, to make sure you're left with only the characters you need. 然后使用正则表达式,就像你一样,确保你只剩下你需要的字符。

There is really nothing wrong with this. 这真的没有错。

PS: Personally I think the person, or persons, who wrote the ICU user guide should not be complimented on a job well done. PS:就我个人而言,我认为编写ICU用户指南的人或人员不应该对做得好的工作表示赞赏。 What a mess. 真是一团糟。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM