简体   繁体   中英

Jsoup to fetch data from html betwwen two <br> tags

I am working on a personal project and wants to parse this html and retrieve information from this.

Basically I want to get all the information that is given inside the 'br' tags,for this I am using JSOUP in java.

I want to store these value as pairs in a map (key,value).

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <style> </style>
  </head>
  <body lang="EN-US" link="#0563C1" vlink="#954F72" style="">
    <div class="WordSection1">
      <p class="MsoNormal">
        <span style=""></span>
      </p>
      <p class="MsoNormal">
        <span style=""></span>
      </p>
      <div>
        <div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
          <p class="MsoNormal">
            <a name="_MailOriginal">
              <b>
                <span style="">From: </span>
              </b>
            </a>
            <span style="">
              <span style=""> ABC (membership@abc.org)
                  <br>
                  <b>Sent: </b> Tuesday, November 24, 2020 8:13 AM <br>
                  <b>To: </b> XYZ <XYZ@abc.com>
                    <br>
                    <b>Subject: </b> Information Request </span>
            </span>
          </p>
        </div>
      </div>
      <p class="MsoNormal">
        <span style=""></span>
      </p>
      <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" align="left" width="100%" style="width:100.0%">
        <tbody>
          <tr style="">
            <td style="background:#910A19; padding:5.25pt 1.5pt 5.25pt 1.5pt">
              <span style=""></span>
            </td>
            <span style=""></span>
            <td width="100%">
              <div>
                <p class="MsoNormal" style="">
                  <span style="">
                    <b>
                      <span style="font-size:12.0pt; font-family:" ` Calibri (Body)`",serif; color:#212121">EXTERNAL EMAIL: Beware of Phishing attacks! </span>
                    </b>
                  </span>
                </p>
              </div>
            </td>
            <span style=""></span>
          </tr>
        </tbody>
      </table>
      <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%; background:#B2B2B2">
        <tbody>
          <tr style="">
            <td style="padding:25.0pt 25.0pt 25.0pt 25.0pt">
              <div align="center">
                <table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="600" style="width:6.25in; background:white; border:solid black 1.0pt">
                  <tbody>
                    <tr style="">
                      <td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt">
                        <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="">
                          <tbody>
                            <tr style="">
                              <td style="border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 3.75pt 7.5pt 3.75pt">
                                <p class="MsoNormal">
                                  <span style="">
                                    <span style="border:solid windowtext 1.0pt; padding:0in">
                                      <img width="100" height="100" id="_x0000_i1025" src="cid:~WRD2635.jpg" alt="Image removed by sender.">
                                    </span>
                                  </span>
                                  <span style="">
                                    <span style=""></span>
                                  </span>
                                </p>
                              </td>
                              <span style=""></span>
                              <td width="100%" style="width:100.0%; border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 3.75pt 7.5pt 3.75pt">
                                <p class="MsoNormal">
                                  <span style="">
                                    <b>
                                      <span style="font-size:18.0pt; font-family:" Arial",sans-serif">AWSCV </span>
                                    </b>
                                  </span>
                                </p>
                              </td>
                              <span style=""></span>
                            </tr>
                          </tbody>
                        </table>
                        <span style=""></span>
                      </td>
                      <span style=""></span>
                    </tr>
                    <tr style="">
                      <td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt">
                        <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="">
                          <tbody>
                            <tr style="">
                              <td style="border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 7.5pt 7.5pt 7.5pt">
                                <p class="MsoNormal">
                                  <span style="">
                                    <span style="font-size:9.0pt; font-family:" Arial",sans-serif">Dear XYZ, <br>
                                      <br>The following Information Request form was submitted by ABC, Company: asd, Email: asd@abc.com on 11/23/2020. <br>
                                      <br>Information: <br>
                                      <br>Legal Business Name <br>Asfdsf <br>
                                      <br>Phone <br>(718) 43543 <br>
                                      <br>Principle Name 1 <br>afdsgsfgsg df <br>
                                      <br>EIN <br>04543 <br>
                                      <br>Bus Street Address <br>fdgfdgfdg <br>
                                      <br>Bus City <br>fgfdvgdsgs <br>
                                      <br>Bus State <br>dsf <br>
                                      <br>Bus Zip Code <br>34543534 <br>
                                      <br>Email Address <br>abc@gamil.com <br>
                                      <br>Secondary Email Address <br>abc@gamil.com <br>
                                      <br>Business Website Address <br>NOEMAIL.COM <br>
                                      <br>DBA info same as Business <br>
                                      <br>DBA information is same as Business. <br>
                                      <br>DBA Name <br>Asfdsf <br>
                                      <br>DBA Street Address <br>sgfdgfdg435435 34 <br>
                                      <br>DBA City <br>ACDCROCK <br>
                                      <br>DBA State <br>AT <br>
                                      <br>DBA Zip Code <br>324324 <br>
                                      <br>DBA Phone <br>(458) 43543543 <br>
                                      <br>DBA Email Address <br>abc@gamil.com <br><br>Secondary DBA Email Address <br>--- No answer --- <br><br>Tertiary DBA Email Address <br>--- No answer --- <br><br>DBA Website Address <br>NOEMAIL.COM <br><br>Secondary DBA Website Address <br>--- No answer --- <br><br>Tertiary DBA Website Address <br>--- No answer --- <br><br>Information Request Text <br>Any information would be helpful <br><br> Description <br>ACCESSORIES <br><br>wegf <br>4545 <br><br>Point of Sale Type <br>dfgfdg/sdgfdsgdsg (Default) <br><br><br><br>Attachments: </span></span>
                                </p><table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="600" style="width:6.25in; background:white; border:outset black 1.0pt"><tbody><tr style=""><td style="padding:2.0pt 2.0pt 2.0pt 2.0pt"><p class="MsoNormal"><span style=""><span style="color:black">Attachments </span></span><span style=""><span style=""></span></span></p></td><span style=""></span><td style="padding:2.0pt 2.0pt 2.0pt 2.0pt"><p class="MsoNormal"><span style=""><span style="color:black"></span></span><span style=""><span style=""></span></span></p></td><span style=""></span></tr></tbody></table><p class="MsoNormal"><span style=""><span style="font-size:9.0pt; font-family:" Arial",sans-serif"><br><br>Your  type includes you in the list of members to whom forms of this type are sent. You can opt out of receiving forms of this type via the Forms link on your Profile screen. </span></span></p>
                              </td><span style=""></span>
                            </tr>
                          </tbody>
                        </table><span style=""></span>
                      </td><span style=""></span>
                    </tr><tr style=""><td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt"><div><p class="MsoNormal"><span style=""><i><span style="font-size:7.5pt; color:#666666">This email was sent in response to the use of the platform and website by AWCC. It was generated by: </span></i></span></p><div style="margin-left:11.25pt; margin-top:3.0pt"><p class="MsoNormal"><span style=""><i><span style="font-size:7.5pt; color:#666666">AAXC, LLC <br>43543543 fgfdgfdg <br>AXD, WE 324324 <br>dgfdgfdgfd (457-dsfds) - Outside the US, call +1 45435435435 </span></i></span></p></div></div></td><span style=""></span></tr>
                  </tbody>
                </table>
              </div><span style=""></span>
            </td><span style=""></span>
          </tr>
        </tbody>
      </table><span style=""></span><p class="MsoNormal"><span style=""></span></p>
    </div>
  </body>
</html>

I am using this code to fetch but this is giving all values in a paragraph.

Document doc = Jsoup.parse(htmlString);
    List<String> valueList = new ArrayList<>();
    Elements keyElements = doc.getElementsByTag("td");
    for (Element keyElement : keyElements) {
      String value = keyElement.text();
      // store in value list

}

I also tried

doc.getElementsByTag("br");

but his is giving empty value.

I want to store each the values in a map like this but not able to separate the value from html as this is coming in paragraph or empty.

My Map..

Key                    VALUE

Phone                 (718) 3543

Legal Business Name      Asfdsf

DBA City                XYXXdsfds

... and so on

Can someone please help me to get this data in a better way?

it must be getElementsByTagName . TT

You can use this solution:


 Document.OutputSettings outputSettings = new Document.OutputSettings();
        outputSettings.prettyPrint(false);
        doc.outputSettings(outputSettings);
        doc.select("br").before("\\n");;
        doc.select("p").before("\\n");
        String str = doc.html().replaceAll("\\\\n", "\n");
        String strWithNewLines = Jsoup.clean(str, "", Safelist.none(), outputSettings);
        System.out.println(strWithNewLines);

I suppose you can try this:

If the HTML String was this:

String html = "<html>\n"
            + "  </head>\n"
            + "<table class=\"MsoNormalTable\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"\">\n"
            + "                            <tbody>\n"
            + "                              <tr style=\"\">\n"
            + "                                <td>\n"
            + "                                  <p class=\"MsoNormal\">\n"
            + "                                    <span style=\"\">\n"
            + "                                      <span style=\"font-size:9.0pt; font-family:\"Arial\",sans-serif\">\n"
            + "                                        <br>\n"
            + "                                        <br>Information: \n"
            + "                                        <br>\n"
            + "                                        <br>Legal Business Name\n"
            + "                                        <br>Asfdsf\n"
            + "                                        <br>\n"
            + "                                        <br>Phone\n"
            + "                                        <br>(718) 43543\n"
            + "                                        <br>\n"
            + "                                        <br>Principle Name 1\n"
            + "                                        <br>afdsgsfgsg df\n"
            + "                                        <br>\n"
            + "                                        <br>Bus Street Address\n"
            + "                                        <br>sdfdsf\n"
            + "                                        <br>\n"
            + "                                        <br>Bus City\n"
            + "                                        <br>sdfdsf\n"
            + "                                        <br>\n"
            + "                                        <br>Bus State\n"
            + "                                        <br>ny\n"
            + "                                        <br>\n"
            + "                                        <br>Bus Zip Code\n"
            + "                                        <br>4324324\n"
            + "                                        <br>\n"
            + "                                        <br>Email Address\n"
            + "                                        <br>dsfdsfds@xyz.com\n"
            + "                                        <br>\n"
            + "                                        <br>Tertiary Email Address\n"
            + "                                        <br>--- No answer ---\n"
            + "                                        <br>\n"
            + "                                        <br>Business Website Address\n"
            + "                                        <br>dsfdsf.com\n"
            + "                                        <br>\n"
            + "                                        <br>DBA info same as Business\n"
            + "                                        <br>\n"
            + "                                        <br>DBA information is same as Business.\n"
            + "                                        <br>\n"
            + "                                        <br>DBA Name\n"
            + "                                        <br>Awqeewd gdfg\n"
            + "                                        <br>\n"
            + "                                        <br>DBA Street Address\n"
            + "                                        <br>dsfdsf 3432 fdgdf\n"
            + "                                        <br>\n"
            + "                                        <br>DBA City\n"
            + "                                        <br>NORTH\n"
            + "                                        <br>\n"
            + "                                        <br>Attachments:\n"
            + "                                      </span>\n"
            + "                                    </span>\n"
            + "                                  </p>\n"
            + "        <p class=\"MsoNormal\">\n"
            + "          <span style=\"\"> \n"
            + "          </span>\n"
            + "        </p>\n"
            + "      </div>\n"
            + "      </body>\n"
            + "    </html>";

And you run this string through the following method provided below:

String[] values = getTextAfterHtmlStartEndTags(html, "br");

// Display the discovered values...
for (String str : values) {
    System.out.println(str);
}

The console Window will display:

Information:

Legal Business Name
Asfdsf

Phone
(718) 43543

Principle Name 1
afdsgsfgsg df

Bus Street Address
sdfdsf

Bus City
sdfdsf

Bus State
ny

Bus Zip Code
4324324

Email Address
dsfdsfds@xyz.com

Tertiary Email Address
--- No answer ---

Business Website Address
dsfdsf.com

DBA info same as Business

DBA information is same as Business.

DBA Name
Awqeewd gdfg

DBA Street Address
dsfdsf 3432 fdgdf

DBA City
NORTH

Attachments:

The getTextAfterHtmlStartEndTags() method:

/**
 *
 * To be used with the JSoup API<br><br>
 * <b>Example Usage:</b><br><pre>
 *
 * <b>Required Imports:</b>
 *
 *  import org.jsoup.Jsoup;
 *  import org.jsoup.nodes.Document;
 *  import org.jsoup.nodes.Element;
 *  import org.jsoup.nodes.Node;
 *  import org.jsoup.select.Elements;
 *
 * <b>Example Code:</b>
 *
 * {@code    String html = "<td>\n"
 *           + "    <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
 *           + "    <span class=\"detailh2\">Total: </span> 31 704                         \n"
 *           + "    <span class=\"detailh2\">Last: </span> 30.12.2021                      \n"
 *           + "</td>";
 *
 *     String[] values = getTextAfterHtmlStartEndTags(html, "span");
 *     for (String str : values) {
 *         System.out.println(str);
 *     }}</pre><br>
 * <p>
 * The console window will display:
 * <pre>
 *
 *      2 145
 *      31 704
 *      30.12.2021</pre><br>
 * <p>
 * If you want the data from a specific HTML tag element then you can supply
 * one or more text elements within those HTML tags in th optional
 * 'specificTo' parameter as a string array or as args, for example:
 * <pre>
 *
 *  {@code   String[] values = getTextAfterHtmlStartEndTags(html, "span", "This month:", "Total:");
 *     for (String str : values) {
 *         System.out.println(str);
 *     }}</pre><br>
 * <p>
 * The console window will display:
 * <pre>
 *
 *      This month: --> 2 145
 *      Total: --> 31 704</pre>
 *
 * @param htmlString         (String) The HTML string to parse.<br>
 *
 * @param htmlStartTagString (String) The HTML start tag to get data
 *                           from.<br>
 *
 * @param specificTo         (String - args) The desired data from multiple
 *                           HTML tags of the same type (see the above
 *                           example code).<br>
 *
 * @return (String[] Array) A single Dimensional String Array containing the
 *         desired data (if properly parsed and found).
 */
public static String[] getTextAfterHtmlStartEndTags(String htmlString,
        String htmlStartTagString, String... specificTo) {
    String html = htmlString;
    List<String> list = new ArrayList<>();
    String value = "N/A";
    Document doc = Jsoup.parse(html);
    Elements elements = doc.select(htmlStartTagString);
    for (Element a : elements) {
        if (specificTo.length > 0) {
            for (int i = 0; i < specificTo.length; i++) {
                if (a.before("</" + htmlStartTagString + ">").text().contains(specificTo[i])) {
                    Node node = a.nextSibling();
                    value = specificTo[i] + " --> " + node.toString().trim();
                    list.add(value);
                }
            }
        }
        else {
            Node node = a.nextSibling();
            value = node.toString().trim();
            list.add(value);
        }
    }
    return list.toArray(new String[list.size()]);
}

You can use Element.wholeText() method to preserve line separators.

Unfortunately it looks like it also preserves depth of indentation so you would need to remove leading spaces or tabulators in each line.

Demo:

String htmlString = "..."; // <--- replace with your HTML

Document doc = Jsoup.parse(htmlString);
Elements keyElements = doc.getElementsByTag("td");
for (Element keyElement : keyElements) {
    String value = keyElement
            .wholeText()
            .trim()                        
            .replaceAll("(?m)^[ \t]+",""); //remove leading spaces and tabs from each line
    System.out.println(value);
    System.out.println("---");
}

Output (based on HTML from question):

Information: 

Legal Business Name
Asfdsf

Phone
(718) 43543

Principle Name 1
afdsgsfgsg df

Bus Street Address
sdfdsf

Bus City
sdfdsf

Bus State
ny

Bus Zip Code
4324324

Email Address
dsfdsfds@xyz.com

Tertiary Email Address
--- No answer ---

Business Website Address
dsfdsf.com

DBA info same as Business

DBA information is same as Business.

DBA Name
Awqeewd gdfg

DBA Street Address
dsfdsf 3432 fdgdf

DBA City
NORTH

Attachments:
---

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM