简体   繁体   English

如何音译非拉丁文字?

[英]How to transliterate non-latin scripts?

I'm playing around with transliteration in PHP using iconv . 我正在使用iconv在PHP中进行音译。 Particularly I want to normalise accented characters and Romanize other scripts from UTF-8 to plain ASCII. 特别是我想规范化带重音符号的字符并将其他脚本从UTF-8罗马化为纯ASCII。

While many characters work, (such as Ž -> Z ) others are giving odd results or raising errors. 当许多字符起作用时(例如Ž > Z ),其他字符却给出了奇怪的结果或引发了错误。

For example, E ACUTE é (U+00E9) transliterates to ASCII with a single quote (U+0027) preceding the e as if it's trying to represent the diacritic mark I'm trying to get rid of. 例如,E急性é (U + 00E9)音译用单引号(U + 0027)前到ASCII e好像它试图表示音调符号标记我试图摆脱。

$utf_8 = "\xC3\xA9"; // <- é
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// returns "'e", not "e"

Non-latin scripts are worse, for example Greek sigma Σ (U+03A3) which should transliterate to latin S is not recognised at all and raises an error: 非拉丁脚本是坏,例如希腊西格玛Σ (U + 03A3)应该音译为拉丁S完全不认可,引发错误:

$utf_8 = "\xCE\xA3"; // <- Σ
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// Raises notice: iconv(): Detected an illegal character in input string

I can just about cope with the first one, but how can I transliterate "Σ" to "S", and do this reliably across other scripts that have equivalent characters? 我可以应付第一个,但是如何将“Σ”音译为“ S”,并在具有相同字符的其他脚本中可靠地做到这一点呢?

I don't mind generating my own tables if there is a good source that works for most european languages. 如果有一个适用于大多数欧洲语言的良好来源,我不介意生成自己的表。

Note that I've tried various collation tables , which are useful for normalising accented latin characters, but they don't work for transliterating between scripts. 请注意,我已经尝试了各种排序规则表 ,这些对于规范带重音的拉丁字符很有用,但是它们不适用于脚本之间的音译。

I've not had much luck using iconv . 我使用iconv不太幸运。 It always manages to throw a bunch of notices. 它总是设法发出一堆通知。

The best luck I've had is with using a custom transliteration table. 我最幸运的是使用自定义音译表。 It's far from perfect but at least you'll feel like you have some solid ground. 它远非完美,但至少您会觉得自己有坚实的基础。

I've not found a good single source for transliteration tables. 我找不到音译表的单一来源。 My unfamiliarity with anything but the latin script isn't helping. 除了拉丁语脚本,我不熟悉其他任何内容都无济于事。

我已经尝试过类似的方法 -它主要基于Doctrine 1代码,但并不完美:但是它似乎可以与我所提交的所有测试数据一起使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM