简体   繁体   中英

How to parse html and keep ALL line breaks?

I have a document that contains <br/> , <p> , and <table> elements

I have been trying to parse this HTML using Jsoup and preserve the lines .

I tried many methods from similar questions but no result

FileInputStream in = new FileInputStream("C:............xxx.htm");
        String htmlText = IOUtils.toString(in);

        File file = new File("C:............xxx.txt") ;
        PrintWriter pr = new PrintWriter(file) ; 

        String text = Jsoup.parse(htmlText.replaceAll("(?i)<br[^>]*>", "br2n")).text();
        System.out.println(text.replaceAll("br2n", "\n"));
        pr.println(text.replaceAll("br2n", "\n"));

//        for (String line : htmlText.split("\n")) {
//            String stripped = Jsoup.parse(line).text();
//            
//            System.out.println(stripped);
//            pr.println(stripped);
//              
//        }

        pr.close();


Here is the representative part of my HTML file (the original file starts with <html> ...of course)

    <table border="0" cellspacing="0" cellpadding="0" bgcolor="white"
    width='650'>
    <tr>
    <td><font size="4"><br />
    &nbsp;<b>The scientific explantion of the syndrom</b></font>
    <table width='650' border="0" cellspacing="5" cellpadding="0">
    <tr>
    <td width='5%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='15%'>&nbsp;</td>
    <td width='30%'>&nbsp;</td>
    </tr>
    <tr height="24">
    <td align="left" nowrap="nowrap" colspan="3"><font size=
    "3"><b>Recent Update</b></font></td>
    <td align="left" nowrap="nowrap"><a name=
    "9J003346248"></a><font size="3"><b>Issue:</b></font></td>
    <td align="left"><font size="3">9569865248</font></td>
    </tr>
    <tr>
    <td>&nbsp;</td>
    <td align="left"><b>Locust:</b></td>
    <td align="left" colspan="3">UYF78UIGK</td>
    </tr>

    </table>

    <br/> The explanation above does not necc....... <p> 
    Blah ....
    </p>

    <table border="2" cellspacing="1" cellpadding="0" bgcolor="white"
    width='750'>
    <tr>
    <td><font size="4"><br />
    &nbsp;<b>Syndrom of the main ......</b></font>
    <table width='650' border="0" cellspacing="5" cellpadding="0">
    <tr>
    <td width='5%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='15%'>&nbsp;</td>
    <td width='30%'>&nbsp;</td>
    </tr>
    <tr height="24">
    <td align="left" nowrap="nowrap" colspan="3"><font size=
    "3"><b>Data</b></font></td>
    <td align="left" nowrap="nowrap"><a name=
    "9J003346248"></a><font size="3"><b>Issue:</b></font></td>
    <td align="left"><font size="3">9509809248</font></td>
    </tr>
    <tr>
    <td>&nbsp;</td>
    <td align="left"><b>Locust:</b></td>
    <td align="left" colspan="3">U344365GK</td>
    </tr>

</table>

<br/> The explanation above does not necc....... <p> 
Blah ....
</p>

I need to make sure that all rows in those table lie one after another the way they do in the original document. But I have multiple tables and other "line breaking elements". How can I do this using Jsoup? Is it possible to parse html and keep line using other api more effectively?

You had it almost right. Try this

String text = Jsoup.parse(htmlText.replaceAll("(?i)</tr>", "</tr> br2n ").replaceAll("(?i)<br[^>]*>", "br2n")).replaceAll("(?i)<p>", "<p> br2n ").replaceAll("(?i)</p>", "</p> br2n ").text();
  System.out.println(text.replaceAll("br2n", "\n"));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM