简体   繁体   中英

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.

I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;

FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        try {
        } catch (IOException e) {

i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p> tag, by saying;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        while(!s.contains("</p>") {
            try {
            } catch (IOException e) {

But this doesn't work. Could someone please help.


Another html parser I really liked using was jsoup . You could get all the <p> elements in 2 lines of code.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");

Then write it out to a file in one more line

out.write(ps.text());  //it will append all of the p elements together in one long string

or if you want them on separate lines you can iterate through the elements and write them out separately.



Try (if you don't want to use a HTML parser library):

        FileReader fileReader = new FileReader(file);
        BufferedReader buffRd = new BufferedReader(fileReader);
        BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
        String s;
        int writeTo = 0;
        while ((s = br.readLine()) !=null) 
                        writeTo = 1;

                        catch (IOException e) 

                        writeTo = 0;

                        catch (IOException e) 

                else if(writeTo==1)
                        catch (IOException e) 


I've had success using TagSoup & XPath to parse HTML.


Use a ParserCallback. Its a simple class thats included with the JDK. It notifies you every time a new tag is found and then you can extract the text of the tag. Simple example:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class ParserCallbackTest extends HTMLEditorKit.ParserCallback
    private int tabLevel = 1;
    private int line = 1;

    public void handleComment(char[] data, int pos)
        displayData(new String(data));

    public void handleEndOfLineString(String eol)
        System.out.println( line++ );

    public void handleEndTag(HTML.Tag tag, int pos)
        displayData("/" + tag);

    public void handleError(String errorMsg, int pos)
        displayData(pos + ":" + errorMsg);

    public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos)
        displayData("mutable:" + tag + ": " + pos + ": " + a);

    public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos)
        displayData( tag + "::" + a );
//      tabLevel++;

    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
        displayData( tag + ":" + a );

    public void handleText(char[] data, int pos)
        displayData( new String(data) );

    private void displayData(String text)
        for (int i = 0; i < tabLevel; i++)


    public static void main(String[] args)
    throws IOException
        ParserCallbackTest parser = new ParserCallbackTest();

        // args[0] is the file to parse

        Reader reader = new FileReader(args[0]);
//      URLConnection conn = new URL(args[0]).openConnection();
//      Reader reader = new InputStreamReader(conn.getInputStream());

            new ParserDelegator().parse(reader, parser, true);
        catch (IOException e)

So all you need to do is set a boolean flag when the paragraph tag is found. Then in the handleText() method you extract the text.

Try this.

 public static void main( String[] args )
    String url = "http://en.wikipedia.org/wiki/Big_data";

    Document document;
    try {
        document = Jsoup.connect(url).get();
        Elements paragraphs = document.select("p");

        Element firstParagraph = paragraphs.first();
        Element lastParagraph = paragraphs.last();
        Element p;
        int i=1;
        System.out.println("*  " +p.text());
        while (p!=lastParagraph){
            System.out.println("*  " +p.text());
} catch (IOException e) {
    // TODO Auto-generated catch block


perl -ne "print if m|<p>| .. m|</p>|" infile.txt >outfile.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM