简体   繁体   English

在Java中形成正则表达式以提取Wiki链接

[英]Forming a regular expression in Java to extract wiki links

I am trying to write a Regex in java to find and extract all the wiki links starting with /wiki that occurs after the first occurrence of a paragraph tag in the html source code of a web page. 我正在尝试在java中编写一个正则表达式来查找和提取以/ wiki开头的所有wiki链接,这些链接是在网页的html源代码中第一次出现段落标记之后发生的。 For example 例如

<a href="/wiki/Computer_scientist" title="Computer scientist">computer scientist</a> 
<p>Its fields can be divided into a variety of theoretical and <a href="/wiki/Practical_disciplines"

This should extract /wiki/Practical_disciplines 这应该提取/ wiki / Practical_disciplines

I am not much familiar with regular expressions but after doing some research what I have came up with is: 我对正则表达式并不熟悉,但在做了一些研究后我得出的是:

ArrayList<String> wikiLinks = new ArrayList<String>();
Pattern wikiPattern = Pattern.compile("^<p>([a-zA-Z0-9+&@/%?<>\"=~_|!,.;])+^(/wiki/[a-zA-Z0-9+&@/%?=~_|!,.;]+");
    Matcher wikiMatcher = wikiPattern.matcher("srcString");
    while (wikiMatcher.find()) {
        wikiLinks.add(srcString.substring(wikiMatcher.start(0),
            wikiMatcher.end(0)));

I know this is poorly formed and far from being even okay. 我知道这个结构很差,甚至还不行。 But if somebody could help me formulate a regex for this or guide me in the right direction, I would really appreciate that. 但如果有人可以帮助我为此制定一个正则表达式或指导我朝着正确的方向发展,我会非常感激。

You could use this regex ... 你可以使用这个正则表达式 ......

<p>.*?href=\"(.*?)\"

see regex demo / explanation 看看正则表达式演示/解释

Java ( demo ) Java演示

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class RegEx {
    public static void main(String[] args) {
        String s = "<a href=\"/wiki/Computer_scientist\" title=\"Computer scientist\">computer scientist</a> <p>Its fields can be divided into a variety of theoretical and <a href=\"/wiki/Practical_disciplines\"";
        String r = "<p>.*?href=\"(.*?)\"";
        Pattern p = Pattern.compile(r);
        Matcher m = p.matcher(s);
        while (m.find()) {
            System.out.println(m.group(1));
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM