简体   繁体   English

使用Java从html标记中提取内容

[英]Extracting content from html tags using java

I extracted data from an html page and then parsed the tags containing tags like this now I tried different ways like extracting substring etc do extract only the title and href tags. 我从html页面提取数据,然后解析包含此类标签的标签,现在我尝试了不同的方法,例如提取子字符串等,仅提取title和href标签。 but it'snot working..Can anyone help me. 但它不起作用..任何人都可以帮我。 This is the small snippet of my output 这是我输出的一小段

my code 我的代码

     doc  = Jsoup.connect("myurl").get();

    Elements link = doc.select("a[href]");
    String stringLink = null;
    for (int i = 0; i < link.size(); i++) 
    {

        stringLink = link.toString();
        System.out.println(stringLink);
     }

output 输出

<a class="link" title="Waf Ad" href="https://www.facebook.com/waf.ad.54" 
data- jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https:
//fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186729_100007938933785_
508764241_q.jpg" alt="Waf Ad" data-jsid="img" /></a>
<a class="link" title="Ana Ga" href="https://www.facebook.com/ata.ga.31392410" 
data-jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https://
fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186901_100002334679352_
162381693_q.jpg" alt="Ana Ga" data-jsid="img" /></a>

You can use the attr() method of Element class to extract the value of attributes. 您可以使用Element类的attr()方法提取属性值。

For example: 例如:

String href = link.attr("href");
String title = link.attr("title");

See this page for more: Extract attributes, text, and HTML from elements 有关更多信息,请参见此页面: 从元素中提取属性,文本和HTML

To get the page title, you can use 要获取页面标题,可以使用

Document doc = Jsoup.connect("myurl").get();
String title = doc.title();

For getting the individual links from the different hrefs, you can use this 为了从不同的href获取单个链接,您可以使用此

Elements links = doc.select("a[href]");
for(Element ele : links) {
    System.out.println(ele.attr("href").toString());
}  

attr() method gives the content inside the matching attributed spedified to it in the given tag. attr()方法提供给定标记中匹配的匹配属性内的内容。

public class Solution{
    public static void main(String[] args){

         Scanner scan = new Scanner(System.in);
        int testCases = Integer.parseInt(scan.nextLine());

        while (testCases-- > 0) {
            String line = scan.nextLine();

            boolean matchFound = false;
            Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
            Matcher m = r.matcher(line);

            while (m.find()) {
                System.out.println(m.group(2));
                matchFound = true;
            }
            if ( ! matchFound) {
                System.out.println("None");
            }
        }
    }
}

在此处输入图片说明

REGULAR EXPRESSION EXPLAINATION: 常规表达说明:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM