使用Java从html标记中提取内容

Question

I extracted data from an html page and then parsed the tags containing tags like this now I tried different ways like extracting substring etc do extract only the title and href tags. 我从html页面提取数据，然后解析包含此类标签的标签，现在我尝试了不同的方法，例如提取子字符串等，仅提取title和href标签。 but it'snot working..Can anyone help me. 但它不起作用..任何人都可以帮我。 This is the small snippet of my output 这是我输出的一小段

my code 我的代码

     doc  = Jsoup.connect("myurl").get();

    Elements link = doc.select("a[href]");
    String stringLink = null;
    for (int i = 0; i < link.size(); i++) 
    {

        stringLink = link.toString();
        System.out.println(stringLink);
     }

output 输出

<a class="link" title="Waf Ad" href="https://www.facebook.com/waf.ad.54" 
data- jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https:
//fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186729_100007938933785_
508764241_q.jpg" alt="Waf Ad" data-jsid="img" /></a>
<a class="link" title="Ana Ga" href="https://www.facebook.com/ata.ga.31392410" 
data-jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https://
fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186901_100002334679352_
162381693_q.jpg" alt="Ana Ga" data-jsid="img" /></a>

Answer 1

You can use the attr() method of Element class to extract the value of attributes. 您可以使用Element类的attr()方法提取属性值。

For example: 例如：

String href = link.attr("href");
String title = link.attr("title");

See this page for more: Extract attributes, text, and HTML from elements 有关更多信息，请参见此页面：从元素中提取属性，文本和HTML

Answer 2

To get the page title, you can use 要获取页面标题，可以使用

Document doc = Jsoup.connect("myurl").get();
String title = doc.title();

For getting the individual links from the different hrefs, you can use this 为了从不同的href获取单个链接，您可以使用此

Elements links = doc.select("a[href]");
for(Element ele : links) {
    System.out.println(ele.attr("href").toString());
}

attr() method gives the content inside the matching attributed spedified to it in the given tag. attr()方法提供给定标记中匹配的匹配属性内的内容。

Answer 3

public class Solution{
    public static void main(String[] args){

         Scanner scan = new Scanner(System.in);
        int testCases = Integer.parseInt(scan.nextLine());

        while (testCases-- > 0) {
            String line = scan.nextLine();

            boolean matchFound = false;
            Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
            Matcher m = r.matcher(line);

            while (m.find()) {
                System.out.println(m.group(2));
                matchFound = true;
            }
            if ( ! matchFound) {
                System.out.println("None");
            }
        }
    }
}

REGULAR EXPRESSION EXPLAINATION: 常规表达说明：

使用Java从html标记中提取内容

问题描述

3 个解决方案

解决方案1
4 已采纳 2014-03-06 09:08:26

解决方案2
3 2014-03-06 09:11:52

解决方案3
0 2019-07-03 17:09:08

使用Java从html标记中提取内容

问题描述

3 个解决方案

解决方案1 4 已采纳 2014-03-06 09:08:26

解决方案2 3 2014-03-06 09:11:52

解决方案3 0 2019-07-03 17:09:08

解决方案1
4 已采纳 2014-03-06 09:08:26

解决方案2
3 2014-03-06 09:11:52

解决方案3
0 2019-07-03 17:09:08