简体   繁体   English

PHP中的智能音译

[英]Intelligent transliteration in PHP

I'm interested in writing a PHP script (I do welcome language-agnostic suggestions) that would transliterate a sentence or word written in English (phoenetically) into the script of another language.我有兴趣编写一个 PHP 脚本(我非常欢迎 language-agnostic 建议),它将用英语(音位)写成的句子或单词音译成另一种语言的脚本。 Since I'm looking at English written phoenetically (ie by ear): I'd have to deal with variant spellings of the same word.由于我正在查看按音位(即按耳朵)书写的英语:我必须处理同一个单词的不同拼写。

It is assumed that no standard exists for romanization (for instance, in Chinese, you have the Simplified Wade, etc.)假设没有罗马化的标准(例如,在中文中,您有简体韦德等)

Does anyone have any advice on where I could start?有人对我可以从哪里开始有任何建议吗?

EDIT: I'm doing this purely for educational purposes, and I was initially under the impression that in order to figure out the connection between variant spellings (which could be found in a corpus of IM messages, Facebook posts written in the romanized form of the language), you'd need some sort of machine learning tool.编辑:我这样做纯粹是为了教育目的,我最初的印象是为了弄清楚变体拼写之间的联系(可以在 IM 消息的语料库中找到,Facebook 帖子以罗马化形式写成语言),你需要某种机器学习工具。 However, I'd like to know if I was on the right track, and I'd like some help in figuring out what next I should look into to get this working (for instance: which machine learning tool should I look into?).但是,我想知道我是否走在正确的轨道上,并且我需要一些帮助来确定下一步我应该研究什么才能使其正常工作(例如:我应该研究哪种机器学习工具?) .

Try Transliteration PHP Extension by Derick Rethans:尝试 Derick Rethans 的音译 PHP 扩展

This extension allows you to transliterate text in non-latin characters (such as Chinese, Cyrillic, Greek etc) to latin characters.此扩展允许您将非拉丁字符(如中文、西里尔文、希腊文等)的文本音译为拉丁字符。 Besides the transliteration the extension also contains filters to upper- and lowercase latin, cyrillic and greek, and perform special forms of transliteration such as converting ligatures such as the Norwegian "æ" to "ae" and normalizing punctuation and spacing.除了音译之外,该扩展还包含对大写和小写拉丁文、西里尔文和希腊文的过滤器,并执行特殊的 forms 音译,例如将挪威语“æ”等连字转换为“ae”以及标准化标点符号和间距。

It seems he has already started on just what you are looking for, (unless you want to deal with english-> latin language. but at least this deals with scripts of other languages: :) )似乎他已经开始研究您正在寻找的东西,(除非您想处理英语-> 拉丁语。但至少这涉及其他语言的脚本::))

I know with Japanese at least, you have a set number of letter combinations.我至少知道日语,你有一定数量的字母组合。

So, you could do something like create a matching array like this所以,你可以像这样创建一个匹配的数组

array(
  'oo' => 'おう',
  'oh' => 'おう',
  'ou' => 'おう'
)

Of course, continuing on, and making sure you don't match 'su', when it should be 'tsu'.当然,继续,并确保你不匹配 'su',而它应该是 'tsu'。

This would only be a starting point, of course.当然,这只是一个起点。

Machine learning is probably most practical with Chinese...but here's a rough start to hiragana: https://gist.github.com/1154969机器学习可能对中文最实用……但这里是平假名的粗略开始: https://gist.github.com/1154969

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM