简体   繁体   English

如何使用RegEx从以下内容提取数据?

[英]How to extract data from the following using RegEx?

I have a data set in the following pattern 我有以下模式的数据集

1<a href="/contact/">Joe</a><br />joe.doe@somemail.com</div>
2<a href="/contact/">Tom</a><br />tom.cat@aol.com</div>
3<a href="/contact/">Jerry</a><br />jerry.mouse@yahoo.co.in</div>

So on... 等等...

I need to extract the name and email id alone from it. 我需要从中单独提取名称和电子邮件ID。 How do I do it? 我该怎么做?


Update: 更新:

Based on your responses, I've changed my data format to: 根据您的回复,我已将数据格式更改为:

1(name)Joe(email)joe.doe@somemail.com(end)
2(name)Tom(email)tom.cat@aol.com(end)
3(name)Jerry(email)jerry.mouse@yahoo.co.in(end)

How do I parse that ? 如何解析?

Don't use regular expressions to parse HTML . 不要使用正则表达式来解析HTML

Use an HTML parser. 使用HTML解析器。 There are a bunch listed on this page . 此页面上列出了一堆。 Based on my experience using Tidy , I would suggest JTidy . 根据我使用Tidy的经验,我建议使用JTidy From their page: 从他们的页面:

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. JTidy是HTML Tidy的Java端口,HTML Tidy是HTML语法检查器和漂亮的打印机。 Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML . 像其非Java表亲一样,JTidy可用作清理格式错误的HTML工具 In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML. 此外,JTidy为正在处理的文档提供了DOM接口, 从而有效地使您能够将JTidy用作真实HTML的DOM解析器。

UPDATE 更新

Based on the edit to your question, use split() to split the string with \\([az]+\\) as a delimiter. 根据对问题的编辑,使用split()\\([az]+\\)作为分隔符来拆分字符串。 This should give you the separate components: 这应该给您单独的组件:

String[] components = str.split("\\([a-z]+\\)");

Or you could use the more generic expression \\(.*?\\) . 或者,您可以使用更通用的表达式\\(.*?\\)

Use this regex: 使用此正则表达式:

\(name\)(.*)\(email\)(.*)\(end\)

Now, the first backreference \\1 contains the name, and the second backreference \\2 contains the email address. 现在,第一个后向引用\\1包含名称,第二个后向引用\\2包含电子邮件地址。

Keep calling the same regex to get the next name and email address. 继续调用相同的正则表达式以获取名字和电子邮件地址。

If you are guaranteed that this will be the standard pattern for all of your entries, you can simply use String.split() on each line, using the regular expression (.*?) as the split pattern. 如果可以保证这将是所有条目的标准模式,则只需在每行上使用String.split(),并使用正则表达式(。*?)作为拆分模式即可。 This will match the ( followed by the least possible number of other characters, followed by another ). 这将与(后跟最少数量的其他字符,再跟另一个)相匹配。 So the code looks something like this: 所以代码看起来像这样:

//for each String line
String[] items = line.split("\\(.*?\\)");
name = items[0];
email = items[1];

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM