简体   繁体   English

从论坛中提取线程头和线程回复

[英]Extract the thread head and thread reply from a forum

I want to extract only the views and replies of the user and the title of the head from a forum. 我想从论坛中仅提取用户的视图和回复以及头部的标题。 In this code when you supply a url the code returns everything. 在此代码中,当您提供url时,代码将返回所有内容。 I just want only the thread heading which is defined in title tag and the user reply which is in between the div content tag. 我只想要在标题标签中定义的线程标题和在div内容标签之间的用户回复。 Help me how extract. 帮帮我如何提取。 Explain how to print this in a txt file 解释如何在txt文件中打印它

package extract;

import java.io.*;

import org.jsoup.*;

import org.jsoup.nodes.*;

public class TestJsoup
{
   public void SimpleParse()  
   {        
        try  
        {

            Document doc = Jsoup.connect("url").get();

            doc.body().wrap("<div></div>");

            doc.body().wrap("<pre></pre>");
            String text = doc.text();
           // Converting nbsp entities

            text = text.replaceAll("\u00A0", " ");

            System.out.print(text);

         }   
         catch (IOException e) 
         {

            e.printStackTrace();

         }

    }

    public static void main(String args[])
    {

      TestJsoup tjs = new TestJsoup();

      tjs.SimpleParse();

    }

}

Why do you wrapt the body-Element in a div and a pre Tag? 为什么你将body-Element包裹在div和pre标签中?

The title-Element can be selected like this: 标题元素可以这样选择:

Document doc = Jsoup.connect("url").get();

Element titleElement = doc.select("title").first();
String titleText = titleElement.text();

// Or shorter ...

String titleText = doc.select("title").first().text();

Div-Tags: DIV-标签:

// Document 'doc' as above

Elements divTags = doc.select("div");


for( Element element : divTags )
{
    // Do something there ... eg. print each element
    System.out.println(element);

    // Or get the Text of it
    String text = element.text();
}

Here's an overview about the whole Jsoup Selector API , this will help you finding any kind of element you need. 这里是关于整个Jsoup Selector API的概述,这将帮助您找到所需的任何元素。

Well I used another code and I collected data from this specific tags. 好吧,我使用了另一个代码,我收集了这个特定标签的数据。

Elements content = doc.getElementsByTag("blockquote"); Elements content = doc.getElementsByTag(“blockquote”);

Elements k=doc.select("[postcontent restore]"); 元素k = doc.select(“[postcontent restore]”);

content.select("blockquote").remove(); 。content.select( “BLOCKQUOTE”)除去();

content.select("br").remove(); 。content.select( “BR”)除去();

content.select("div").remove(); 。content.select( “DIV”)除去();

content.select("a").remove(); 。content.select( “a”)的除去();

content.select("b").remove(); 。content.select( “B”)除去();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM