复杂的字符串解析

Question

I am working with a Chinese database in text that saves entries in this format: 我正在使用文本中的中文数据库，以这种格式保存条目：

Traditional Simplified [pin1 yin1] /English equivalent 1/equivalent 2/ 繁体简体[pin1 yin1] /英文等值1 /等值2 /

I've tried parsing it using delimiters (in Java). 我尝试使用分隔符（在Java中）解析它。

This is what I have so far: 这是我到目前为止：

                    String delims = "[\\[\\]/]+";
                    String tokens[] = str.split(delims);

The problem is that the English equivalent also contains delimiter tokens. 问题是英语等价物还包含分隔符令牌。

For instance: 例如：

⿔⿔ [gui1] /variant of 龜|龟[gui1]/ ⿔⿔[gui1] /变种龟|龟[gui1] /

How would someone parse this String? 怎么会有人解析这个字符串？

I'm trying to get the following information from the String: 我正在尝试从字符串中获取以下信息：

Simplified : ⿔ 简化：⿔

Traditional : ⿔ 传统：⿔

Pinyin : gui1 拼音：gui1

English Equivalent : variant of 龜|龟[gui1] 英文名词 ：龟的变种|龟[gui1]

Answer 1

Try using regex to cleanup the whole string. 尝试使用正则表达式来清理整个字符串。

String text = "⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/";

String pattern =    "(\\S+)\\s*(\\S+)\\s*\\[(.+?)\\]\\s*/(.+?)/";

text = text.replaceAll(pattern, "$1;$2;$3;$4"));

(\\\\S+) ---> ⿔ (\\\\S+) ---> ⿔
find continuous non-white space group 找到连续的非白色空间群

\\\\s* ---> \\\\s* --->
find continuous white space 找到连续的空白区域

\\\\[(.+?)\\\\] ---> gui1 \\\\[(.+?)\\\\] ---> gui1
find everything inside [ bla bla bla ] . 找到所有内容[ bla bla bla ] 。
'?' '？' will match shortest possible answer. 将匹配最短的答案。
eg [ bla bla ] rather than [ bla bla] [ble ble ] 例如[ bla bla ]而不是[ bla bla] [ble ble ]

/(.+?)/ ---> variant of 龜|龟[gui1] /(.+?)/ ---> variant of 龜|龟[gui1]
same as above, but find everything inside / bla bla / 同上，但找到里面的一切/ bla bla /
'?' '？' will match shortest 将匹配最短

You can test the regex here 你可以在这里测试正则表达式

Now text becomes: 现在text变成：
⿔;⿔;gui1;variant of 龜|龟[gui1]

Next you can continue to use ; 接下来你可以继续使用; as delims to split them 作为分裂他们的delims

String tokens[] = text.split(";");

Answer 2

The regex pattern is just a tad more complex, as there are often several definitions in the CEDICT : 正则表达式模式稍微复杂一些，因为CEDICT中经常有几个定义：

矮小 矮小 [ai3 xiao3] /short and small/low and small/undersized/

So the regex would be: 所以正则表达式将是：

^(\S+)\s+(\S+)\s+\[[^]]+\]\s+(/[^/\r]*){1,19}/$

复杂的字符串解析

问题描述

2 个解决方案

解决方案1
2 已采纳 2012-01-17 06:35:50

解决方案2
0 2012-05-28 17:33:48

复杂的字符串解析

问题描述

2 个解决方案

解决方案1 2 已采纳 2012-01-17 06:35:50

解决方案2 0 2012-05-28 17:33:48

解决方案1
2 已采纳 2012-01-17 06:35:50

解决方案2
0 2012-05-28 17:33:48