[英]How to get all all html links and store in a variable Jsoup
How can I get the a[href] links of the selected value from crawler(): listModel
and use the link in openie(): Jsoup.connect()
? 如何从
crawler(): listModel
选定值的a [href]链接,并在openie(): Jsoup.connect()
使用链接?
Here is my code: 这是我的代码:
private void btnExtractActionPerformed(java.awt.event.ActionEvent evt) {
try {
openie();
} catch (IOException ex) {
Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}
}
private void btnSearchActionPerformed(java.awt.event.ActionEvent evt) {
try {
crawler();
} catch (IOException ex) {
Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}
}
private String subject, object, link;
private void crawler() throws IOException {
//replace space with "+"
String input = txtSearch.getText().replace(" ", "+");
int count = 0;
DefaultListModel<String> listModel = new DefaultListModel<>();
while (count <= 20) {
String url = "https://www.google.com/search?q=" + input + "&tbm=nws&source=lnm&start=" + count;
Document doc = Jsoup.connect(url).userAgent("Mozilla").timeout(10000).get();
Elements e = doc.select("div.g a[href]");
for (Element e1 : e) {
listModel.addElement(e1.text());
link = e1.absUrl("href");
}
count += 10;
}
newsList.setModel(listModel);
}
private void openie() throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
List<String> list = newsList.getSelectedValuesList();
Document d = Jsoup.connect(link).userAgent("Mozilla").timeout(10000).get();
DefaultListModel<String> extractedList = new DefaultListModel<>();
for (String selected : list) {
extractedList.addElement(selected);
selected = d.body().text();
Annotation doc = new Annotation(selected);
pipeline.annotate(doc);
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples
= sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
subject = triple.subjectLemmaGloss().replace(" ", "_").toLowerCase();
object = triple.objectLemmaGloss().replace(" ", "_").toLowerCase();
extractedList.addElement(triple.confidence + "\t"
+ triple.subjectLemmaGloss() + "\t"
+ triple.relationLemmaGloss() + "\t"
+ triple.objectLemmaGloss());
}
}
}
tuplesList.setModel(extractedList);
}
With my code, I parse through an URL and able to get a [href]. 使用我的代码,我解析了一个URL并能够获得[href]。 But is the last element's a[href] of all elements that were traversed through.
但是,所遍历的所有元素中的最后一个元素的a [href]是。 I am unable to store a[href] of each element that is retrieved.
我无法存储检索到的每个元素的[href]。
You need List variable to store url like as below. 您需要使用List变量来存储网址,如下所示。
Elements links = doc.select("div.g a[href]");
List<String> urls = links.stream().map(i -> i.absUrl("href")).collect(Collectors.toList());
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.