[英]Extracting content from html tags using java
I extracted data from an html page and then parsed the tags containing tags like this now I tried different ways like extracting substring etc do extract only the title and href tags. 我从html页面提取数据,然后解析包含此类标签的标签,现在我尝试了不同的方法,例如提取子字符串等,仅提取title和href标签。 but it'snot working..Can anyone help me. 但它不起作用..任何人都可以帮我。 This is the small snippet of my output 这是我输出的一小段
my code 我的代码
doc = Jsoup.connect("myurl").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}
output 输出
<a class="link" title="Waf Ad" href="https://www.facebook.com/waf.ad.54"
data- jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https:
//fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186729_100007938933785_
508764241_q.jpg" alt="Waf Ad" data-jsid="img" /></a>
<a class="link" title="Ana Ga" href="https://www.facebook.com/ata.ga.31392410"
data-jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https://
fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186901_100002334679352_
162381693_q.jpg" alt="Ana Ga" data-jsid="img" /></a>
You can use the attr()
method of Element class to extract the value of attributes. 您可以使用Element类的attr()
方法提取属性值。
For example: 例如:
String href = link.attr("href");
String title = link.attr("title");
See this page for more: Extract attributes, text, and HTML from elements 有关更多信息,请参见此页面: 从元素中提取属性,文本和HTML
To get the page title, you can use 要获取页面标题,可以使用
Document doc = Jsoup.connect("myurl").get();
String title = doc.title();
For getting the individual links from the different hrefs, you can use this 为了从不同的href获取单个链接,您可以使用此
Elements links = doc.select("a[href]");
for(Element ele : links) {
System.out.println(ele.attr("href").toString());
}
attr()
method gives the content inside the matching attributed spedified to it in the given tag. attr()
方法提供给定标记中匹配的匹配属性内的内容。
public class Solution{
public static void main(String[] args){
Scanner scan = new Scanner(System.in);
int testCases = Integer.parseInt(scan.nextLine());
while (testCases-- > 0) {
String line = scan.nextLine();
boolean matchFound = false;
Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(2));
matchFound = true;
}
if ( ! matchFound) {
System.out.println("None");
}
}
}
}
REGULAR EXPRESSION EXPLAINATION: 常规表达说明:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.