简体   繁体   English

复杂的字符串解析

[英]Complex String Parsing

I am working with a Chinese database in text that saves entries in this format: 我正在使用文本中的中文数据库,以这种格式保存条目:

Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/ 繁体简体[pin1 yin1] /英文等值1 /等值2 /

I've tried parsing it using delimiters (in Java). 我尝试使用分隔符(在Java中)解析它。

This is what I have so far: 这是我到目前为止:

                    String delims = "[\\[\\]/]+";
                    String tokens[] = str.split(delims);

The problem is that the English equivalent also contains delimiter tokens. 问题是英语等价物还包含分隔符令牌。

For instance: 例如:

⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/ ⿔⿔[gui1] /变种龟|龟[gui1] /

How would someone parse this String? 怎么会有人解析这个字符串?

I'm trying to get the following information from the String: 我正在尝试从字符串中获取以下信息:

Simplified : ⿔ 简化 :⿔

Traditional : ⿔ 传统 :⿔

Pinyin : gui1 拼音 :gui1

English Equivalent : variant of 龜|龟[gui1] 英文名词 :龟的变种|龟[gui1]

Try using regex to cleanup the whole string. 尝试使用正则表达式来清理整个字符串。

String text = "⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/";

String pattern =    "(\\S+)\\s*(\\S+)\\s*\\[(.+?)\\]\\s*/(.+?)/";

text = text.replaceAll(pattern, "$1;$2;$3;$4"));

(\\\\S+) ---> (\\\\S+) --->
find continuous non-white space group 找到连续的非白色空间群

\\\\s* ---> \\\\s* --->
find continuous white space 找到连续的空白区域

\\\\[(.+?)\\\\] ---> gui1 \\\\[(.+?)\\\\] ---> gui1
find everything inside [ bla bla bla ] . 找到所有内容[ bla bla bla ]
'?' '?' will match shortest possible answer. 将匹配最短的答案。
eg [ bla bla ] rather than [ bla bla] [ble ble ] 例如[ bla bla ]而不是[ bla bla] [ble ble ]

/(.+?)/ ---> variant of 龜|龟[gui1] /(.+?)/ ---> variant of 龜|龟[gui1]
same as above, but find everything inside / bla bla / 同上,但找到里面的一切/ bla bla /
'?' '?' will match shortest 将匹配最短

You can test the regex here 你可以在这里测试正则表达式


Now text becomes: 现在text变成:
⿔;⿔;gui1;variant of 龜|龟[gui1]

Next you can continue to use ; 接下来你可以继续使用; as delims to split them 作为分裂他们的delims

String tokens[] = text.split(";");

The regex pattern is just a tad more complex, as there are often several definitions in the CEDICT : 正则表达式模式稍微复杂一些,因为CEDICT中经常有几个定义:

矮小 矮小 [ai3 xiao3] /short and small/low and small/undersized/

So the regex would be: 所以正则表达式将是:

^(\S+)\s+(\S+)\s+\[[^]]+\]\s+(/[^/\r]*){1,19}/$

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM