[英]Complex String Parsing
I am working with a Chinese database in text that saves entries in this format: 我正在使用文本中的中文数据库,以这种格式保存条目:
Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/ 繁体简体[pin1 yin1] /英文等值1 /等值2 /
I've tried parsing it using delimiters (in Java). 我尝试使用分隔符(在Java中)解析它。
This is what I have so far: 这是我到目前为止:
String delims = "[\\[\\]/]+";
String tokens[] = str.split(delims);
The problem is that the English equivalent also contains delimiter tokens. 问题是英语等价物还包含分隔符令牌。
For instance: 例如:
⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/ ⿔⿔[gui1] /变种龟|龟[gui1] /
How would someone parse this String? 怎么会有人解析这个字符串?
I'm trying to get the following information from the String: 我正在尝试从字符串中获取以下信息:
Simplified : ⿔ 简化 :⿔
Traditional : ⿔ 传统 :⿔
Pinyin : gui1 拼音 :gui1
English Equivalent : variant of 龜|龟[gui1] 英文名词 :龟的变种|龟[gui1]
Try using regex to cleanup the whole string. 尝试使用正则表达式来清理整个字符串。
String text = "⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/";
String pattern = "(\\S+)\\s*(\\S+)\\s*\\[(.+?)\\]\\s*/(.+?)/";
text = text.replaceAll(pattern, "$1;$2;$3;$4"));
(\\\\S+)
---> ⿔
(\\\\S+)
---> ⿔
find continuous non-white space group 找到连续的非白色空间群
\\\\s*
---> \\\\s*
--->
find continuous white space 找到连续的空白区域
\\\\[(.+?)\\\\]
---> gui1
\\\\[(.+?)\\\\]
---> gui1
find everything inside [ bla bla bla ] . 找到所有内容[ bla bla bla ] 。
'?' '?' will match shortest possible answer.
将匹配最短的答案。
eg [ bla bla ] rather than [ bla bla] [ble ble ] 例如[ bla bla ]而不是[ bla bla] [ble ble ]
/(.+?)/
---> variant of 龜|龟[gui1]
/(.+?)/
---> variant of 龜|龟[gui1]
same as above, but find everything inside / bla bla / 同上,但找到里面的一切/ bla bla /
'?' '?' will match shortest
将匹配最短
You can test the regex here 你可以在这里测试正则表达式
Now text
becomes: 现在
text
变成:
⿔;⿔;gui1;variant of 龜|龟[gui1]
Next you can continue to use ;
接下来你可以继续使用
;
as delims to split them 作为分裂他们的delims
String tokens[] = text.split(";");
The regex pattern is just a tad more complex, as there are often several definitions in the CEDICT : 正则表达式模式稍微复杂一些,因为CEDICT中经常有几个定义:
矮小 矮小 [ai3 xiao3] /short and small/low and small/undersized/
So the regex would be: 所以正则表达式将是:
^(\S+)\s+(\S+)\s+\[[^]]+\]\s+(/[^/\r]*){1,19}/$
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.