[英]Why html code in chrome devtools and html code parsed by jsoup are different?
I'm trying to extract information about created date of issues from HADOOP Jira issue site( https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues )我正在尝试从 HADOOP Jira 问题站点( https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues )中提取有关问题创建日期的信息
As you can see in this Screenshot , created date is the text between the time tag whose class is live stamp(eg <time class=livestamp ...> 'this text' </time>
)正如你在这个截图中看到的,创建日期是时间标签之间的文本,其类是实时戳(例如
<time class=livestamp ...> 'this text' </time>
)
So, I tried parse it with code as below.所以,我试着用下面的代码解析它。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e.text());
}
}
}
I expect that created date is extracted, but the actual output is # of elements : 0 .我希望提取创建日期,但实际输出是# of elements : 0 。
I found this is something wrong.我发现这是错误的。 So, I tried to parse whole html code from that side with below code.
所以,我试图用下面的代码从那一边解析整个 html 代码。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("*"); //This line finds whole elements in html document.
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e);
}
}
}
I compared both the html code in chrome devtools and the html code that I parsed one by one.我将chrome devtools中的html代码和我解析出来的html代码一一对比。 Then I found those are different.
然后我发现那些是不同的。
Can you explain why this happens and give me some advices how to extract created date?你能解释为什么会发生这种情况并给我一些如何提取创建日期的建议吗?
I advice you to get elements with "time" tag, and use select to get time tags which have "livestamp" class. 我建议您获取带有“ time”标签的元素,并使用select获取具有“ livestamp”类的时间标签。 Here is the example:
这是示例:
Elements timeTags = doc.select("time");
Element timeLivestamp = null;
for(Element tag:timeTags){
Element livestamp = tag.selectFirst(".livestamp");
if(livestamp != null){
timeLivestamp = livestamp;
break;
}
}
I don't know why but when I want to use .select() method of Jsoup with more than 1 selector (as you used like time.livestamp), I get interesting outputs like this. 我不知道为什么,但是当我想将Jsoup的.select()方法与多个选择器一起使用时(就像您使用的time.livestamp一样),我会得到类似这样的有趣输出。
import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.Java.*;
import java.util.*;
import java.io.*;
import java.net.*;
public class Scrape
{
public static void main(String[] argv) throws IOException
{
// This URL does not appear to have an HTML Element with a "TimeStamp" as you have stated.
// ==> Go to any browser and view it for yourself! (Click "View Source" in Google-Chrome, I.E., Safari, etc...)
// URL url = new URL("https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues");
URL url = new URL("https://some.url.org/");
// This scrapes the web-page into a standard Java-Vector.
// HTMLNode is abstract, and has only 2 classes that inherit it. (3 actually, but one is the "CommentNode")
Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false);
// This will output each & every node in the page to a text/html file called "output.html"
// Read Documentation Files for "Util.pageToString" and "FileRW.writeFile"
FileRW.writeFile(Util.pageToString(page), "output.html");
// If this is the question to identify:
// As you can see in this Screenshot, created date is the text between the time tag whose class is
// live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
//
// Using the "NodeSearch.InnerTagGetInclusive" class will retrieve the information you need
Vector<HTMLNode> liveStamp = InnerTagGetInclusive.first(page, "time", "class", TextComparitor.CN_CI, "livestamp");
// This will get eliminate of all the "TagNode" elements when building a this String.
// It will leave you with only the "TextNode" elements.
// This remaining TextNode's should, indeed, be the the "this text" as a string.
String liveStampStr = Util.textNodesString(liveStamp);
System.out.println("Live-Stamp: " + liveStampStr);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.