简体   繁体   English

Jsoup 从 html 获取数据<br>标签

[英]Jsoup to fetch data from html betwwen two <br> tags

I am working on a personal project and wants to parse this html and retrieve information from this.我正在做一个个人项目,想解析这个 html 并从中检索信息。

Basically I want to get all the information that is given inside the 'br' tags,for this I am using JSOUP in java.基本上我想获取“br”标签中给出的所有信息,为此我在 java 中使用 JSOUP。

I want to store these value as pairs in a map (key,value).我想将这些值作为对存储在 map(键,值)中。

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <style> </style>
  </head>
  <body lang="EN-US" link="#0563C1" vlink="#954F72" style="">
    <div class="WordSection1">
      <p class="MsoNormal">
        <span style=""></span>
      </p>
      <p class="MsoNormal">
        <span style=""></span>
      </p>
      <div>
        <div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
          <p class="MsoNormal">
            <a name="_MailOriginal">
              <b>
                <span style="">From: </span>
              </b>
            </a>
            <span style="">
              <span style=""> ABC (membership@abc.org)
                  <br>
                  <b>Sent: </b> Tuesday, November 24, 2020 8:13 AM <br>
                  <b>To: </b> XYZ <XYZ@abc.com>
                    <br>
                    <b>Subject: </b> Information Request </span>
            </span>
          </p>
        </div>
      </div>
      <p class="MsoNormal">
        <span style=""></span>
      </p>
      <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" align="left" width="100%" style="width:100.0%">
        <tbody>
          <tr style="">
            <td style="background:#910A19; padding:5.25pt 1.5pt 5.25pt 1.5pt">
              <span style=""></span>
            </td>
            <span style=""></span>
            <td width="100%">
              <div>
                <p class="MsoNormal" style="">
                  <span style="">
                    <b>
                      <span style="font-size:12.0pt; font-family:" ` Calibri (Body)`",serif; color:#212121">EXTERNAL EMAIL: Beware of Phishing attacks! </span>
                    </b>
                  </span>
                </p>
              </div>
            </td>
            <span style=""></span>
          </tr>
        </tbody>
      </table>
      <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%; background:#B2B2B2">
        <tbody>
          <tr style="">
            <td style="padding:25.0pt 25.0pt 25.0pt 25.0pt">
              <div align="center">
                <table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="600" style="width:6.25in; background:white; border:solid black 1.0pt">
                  <tbody>
                    <tr style="">
                      <td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt">
                        <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="">
                          <tbody>
                            <tr style="">
                              <td style="border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 3.75pt 7.5pt 3.75pt">
                                <p class="MsoNormal">
                                  <span style="">
                                    <span style="border:solid windowtext 1.0pt; padding:0in">
                                      <img width="100" height="100" id="_x0000_i1025" src="cid:~WRD2635.jpg" alt="Image removed by sender.">
                                    </span>
                                  </span>
                                  <span style="">
                                    <span style=""></span>
                                  </span>
                                </p>
                              </td>
                              <span style=""></span>
                              <td width="100%" style="width:100.0%; border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 3.75pt 7.5pt 3.75pt">
                                <p class="MsoNormal">
                                  <span style="">
                                    <b>
                                      <span style="font-size:18.0pt; font-family:" Arial",sans-serif">AWSCV </span>
                                    </b>
                                  </span>
                                </p>
                              </td>
                              <span style=""></span>
                            </tr>
                          </tbody>
                        </table>
                        <span style=""></span>
                      </td>
                      <span style=""></span>
                    </tr>
                    <tr style="">
                      <td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt">
                        <table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="">
                          <tbody>
                            <tr style="">
                              <td style="border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 7.5pt 7.5pt 7.5pt">
                                <p class="MsoNormal">
                                  <span style="">
                                    <span style="font-size:9.0pt; font-family:" Arial",sans-serif">Dear XYZ, <br>
                                      <br>The following Information Request form was submitted by ABC, Company: asd, Email: asd@abc.com on 11/23/2020. <br>
                                      <br>Information: <br>
                                      <br>Legal Business Name <br>Asfdsf <br>
                                      <br>Phone <br>(718) 43543 <br>
                                      <br>Principle Name 1 <br>afdsgsfgsg df <br>
                                      <br>EIN <br>04543 <br>
                                      <br>Bus Street Address <br>fdgfdgfdg <br>
                                      <br>Bus City <br>fgfdvgdsgs <br>
                                      <br>Bus State <br>dsf <br>
                                      <br>Bus Zip Code <br>34543534 <br>
                                      <br>Email Address <br>abc@gamil.com <br>
                                      <br>Secondary Email Address <br>abc@gamil.com <br>
                                      <br>Business Website Address <br>NOEMAIL.COM <br>
                                      <br>DBA info same as Business <br>
                                      <br>DBA information is same as Business. <br>
                                      <br>DBA Name <br>Asfdsf <br>
                                      <br>DBA Street Address <br>sgfdgfdg435435 34 <br>
                                      <br>DBA City <br>ACDCROCK <br>
                                      <br>DBA State <br>AT <br>
                                      <br>DBA Zip Code <br>324324 <br>
                                      <br>DBA Phone <br>(458) 43543543 <br>
                                      <br>DBA Email Address <br>abc@gamil.com <br><br>Secondary DBA Email Address <br>--- No answer --- <br><br>Tertiary DBA Email Address <br>--- No answer --- <br><br>DBA Website Address <br>NOEMAIL.COM <br><br>Secondary DBA Website Address <br>--- No answer --- <br><br>Tertiary DBA Website Address <br>--- No answer --- <br><br>Information Request Text <br>Any information would be helpful <br><br> Description <br>ACCESSORIES <br><br>wegf <br>4545 <br><br>Point of Sale Type <br>dfgfdg/sdgfdsgdsg (Default) <br><br><br><br>Attachments: </span></span>
                                </p><table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="600" style="width:6.25in; background:white; border:outset black 1.0pt"><tbody><tr style=""><td style="padding:2.0pt 2.0pt 2.0pt 2.0pt"><p class="MsoNormal"><span style=""><span style="color:black">Attachments </span></span><span style=""><span style=""></span></span></p></td><span style=""></span><td style="padding:2.0pt 2.0pt 2.0pt 2.0pt"><p class="MsoNormal"><span style=""><span style="color:black"></span></span><span style=""><span style=""></span></span></p></td><span style=""></span></tr></tbody></table><p class="MsoNormal"><span style=""><span style="font-size:9.0pt; font-family:" Arial",sans-serif"><br><br>Your  type includes you in the list of members to whom forms of this type are sent. You can opt out of receiving forms of this type via the Forms link on your Profile screen. </span></span></p>
                              </td><span style=""></span>
                            </tr>
                          </tbody>
                        </table><span style=""></span>
                      </td><span style=""></span>
                    </tr><tr style=""><td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt"><div><p class="MsoNormal"><span style=""><i><span style="font-size:7.5pt; color:#666666">This email was sent in response to the use of the platform and website by AWCC. It was generated by: </span></i></span></p><div style="margin-left:11.25pt; margin-top:3.0pt"><p class="MsoNormal"><span style=""><i><span style="font-size:7.5pt; color:#666666">AAXC, LLC <br>43543543 fgfdgfdg <br>AXD, WE 324324 <br>dgfdgfdgfd (457-dsfds) - Outside the US, call +1 45435435435 </span></i></span></p></div></div></td><span style=""></span></tr>
                  </tbody>
                </table>
              </div><span style=""></span>
            </td><span style=""></span>
          </tr>
        </tbody>
      </table><span style=""></span><p class="MsoNormal"><span style=""></span></p>
    </div>
  </body>
</html>

I am using this code to fetch but this is giving all values in a paragraph.我正在使用此代码来获取,但这会给出段落中的所有值。

Document doc = Jsoup.parse(htmlString);
    List<String> valueList = new ArrayList<>();
    Elements keyElements = doc.getElementsByTag("td");
    for (Element keyElement : keyElements) {
      String value = keyElement.text();
      // store in value list

}

I also tried我也试过

doc.getElementsByTag("br");

but his is giving empty value.但他的价值是空的。

I want to store each the values in a map like this but not able to separate the value from html as this is coming in paragraph or empty.我想像这样将每个值存储在 map 中,但无法将值与 html 分开,因为这是在段落中或为空的。

My Map..我的 Map..

Key                    VALUE

Phone                 (718) 3543

Legal Business Name      Asfdsf

DBA City                XYXXdsfds

... and so on

Can someone please help me to get this data in a better way?有人可以帮我以更好的方式获取这些数据吗?

it must be getElementsByTagName .它必须是getElementsByTagName TT TT

You can use this solution:您可以使用此解决方案:


 Document.OutputSettings outputSettings = new Document.OutputSettings();
        outputSettings.prettyPrint(false);
        doc.outputSettings(outputSettings);
        doc.select("br").before("\\n");;
        doc.select("p").before("\\n");
        String str = doc.html().replaceAll("\\\\n", "\n");
        String strWithNewLines = Jsoup.clean(str, "", Safelist.none(), outputSettings);
        System.out.println(strWithNewLines);

I suppose you can try this:我想你可以试试这个:

If the HTML String was this:如果 HTML 字符串是这样的:

String html = "<html>\n"
            + "  </head>\n"
            + "<table class=\"MsoNormalTable\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"\">\n"
            + "                            <tbody>\n"
            + "                              <tr style=\"\">\n"
            + "                                <td>\n"
            + "                                  <p class=\"MsoNormal\">\n"
            + "                                    <span style=\"\">\n"
            + "                                      <span style=\"font-size:9.0pt; font-family:\"Arial\",sans-serif\">\n"
            + "                                        <br>\n"
            + "                                        <br>Information: \n"
            + "                                        <br>\n"
            + "                                        <br>Legal Business Name\n"
            + "                                        <br>Asfdsf\n"
            + "                                        <br>\n"
            + "                                        <br>Phone\n"
            + "                                        <br>(718) 43543\n"
            + "                                        <br>\n"
            + "                                        <br>Principle Name 1\n"
            + "                                        <br>afdsgsfgsg df\n"
            + "                                        <br>\n"
            + "                                        <br>Bus Street Address\n"
            + "                                        <br>sdfdsf\n"
            + "                                        <br>\n"
            + "                                        <br>Bus City\n"
            + "                                        <br>sdfdsf\n"
            + "                                        <br>\n"
            + "                                        <br>Bus State\n"
            + "                                        <br>ny\n"
            + "                                        <br>\n"
            + "                                        <br>Bus Zip Code\n"
            + "                                        <br>4324324\n"
            + "                                        <br>\n"
            + "                                        <br>Email Address\n"
            + "                                        <br>dsfdsfds@xyz.com\n"
            + "                                        <br>\n"
            + "                                        <br>Tertiary Email Address\n"
            + "                                        <br>--- No answer ---\n"
            + "                                        <br>\n"
            + "                                        <br>Business Website Address\n"
            + "                                        <br>dsfdsf.com\n"
            + "                                        <br>\n"
            + "                                        <br>DBA info same as Business\n"
            + "                                        <br>\n"
            + "                                        <br>DBA information is same as Business.\n"
            + "                                        <br>\n"
            + "                                        <br>DBA Name\n"
            + "                                        <br>Awqeewd gdfg\n"
            + "                                        <br>\n"
            + "                                        <br>DBA Street Address\n"
            + "                                        <br>dsfdsf 3432 fdgdf\n"
            + "                                        <br>\n"
            + "                                        <br>DBA City\n"
            + "                                        <br>NORTH\n"
            + "                                        <br>\n"
            + "                                        <br>Attachments:\n"
            + "                                      </span>\n"
            + "                                    </span>\n"
            + "                                  </p>\n"
            + "        <p class=\"MsoNormal\">\n"
            + "          <span style=\"\"> \n"
            + "          </span>\n"
            + "        </p>\n"
            + "      </div>\n"
            + "      </body>\n"
            + "    </html>";

And you run this string through the following method provided below:您可以通过下面提供的以下方法运行此字符串:

String[] values = getTextAfterHtmlStartEndTags(html, "br");

// Display the discovered values...
for (String str : values) {
    System.out.println(str);
}

The console Window will display:控制台 Window 将显示:

Information:

Legal Business Name
Asfdsf

Phone
(718) 43543

Principle Name 1
afdsgsfgsg df

Bus Street Address
sdfdsf

Bus City
sdfdsf

Bus State
ny

Bus Zip Code
4324324

Email Address
dsfdsfds@xyz.com

Tertiary Email Address
--- No answer ---

Business Website Address
dsfdsf.com

DBA info same as Business

DBA information is same as Business.

DBA Name
Awqeewd gdfg

DBA Street Address
dsfdsf 3432 fdgdf

DBA City
NORTH

Attachments:

The getTextAfterHtmlStartEndTags() method: getTextAfterHtmlStartEndTags()方法:

/**
 *
 * To be used with the JSoup API<br><br>
 * <b>Example Usage:</b><br><pre>
 *
 * <b>Required Imports:</b>
 *
 *  import org.jsoup.Jsoup;
 *  import org.jsoup.nodes.Document;
 *  import org.jsoup.nodes.Element;
 *  import org.jsoup.nodes.Node;
 *  import org.jsoup.select.Elements;
 *
 * <b>Example Code:</b>
 *
 * {@code    String html = "<td>\n"
 *           + "    <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
 *           + "    <span class=\"detailh2\">Total: </span> 31 704                         \n"
 *           + "    <span class=\"detailh2\">Last: </span> 30.12.2021                      \n"
 *           + "</td>";
 *
 *     String[] values = getTextAfterHtmlStartEndTags(html, "span");
 *     for (String str : values) {
 *         System.out.println(str);
 *     }}</pre><br>
 * <p>
 * The console window will display:
 * <pre>
 *
 *      2 145
 *      31 704
 *      30.12.2021</pre><br>
 * <p>
 * If you want the data from a specific HTML tag element then you can supply
 * one or more text elements within those HTML tags in th optional
 * 'specificTo' parameter as a string array or as args, for example:
 * <pre>
 *
 *  {@code   String[] values = getTextAfterHtmlStartEndTags(html, "span", "This month:", "Total:");
 *     for (String str : values) {
 *         System.out.println(str);
 *     }}</pre><br>
 * <p>
 * The console window will display:
 * <pre>
 *
 *      This month: --> 2 145
 *      Total: --> 31 704</pre>
 *
 * @param htmlString         (String) The HTML string to parse.<br>
 *
 * @param htmlStartTagString (String) The HTML start tag to get data
 *                           from.<br>
 *
 * @param specificTo         (String - args) The desired data from multiple
 *                           HTML tags of the same type (see the above
 *                           example code).<br>
 *
 * @return (String[] Array) A single Dimensional String Array containing the
 *         desired data (if properly parsed and found).
 */
public static String[] getTextAfterHtmlStartEndTags(String htmlString,
        String htmlStartTagString, String... specificTo) {
    String html = htmlString;
    List<String> list = new ArrayList<>();
    String value = "N/A";
    Document doc = Jsoup.parse(html);
    Elements elements = doc.select(htmlStartTagString);
    for (Element a : elements) {
        if (specificTo.length > 0) {
            for (int i = 0; i < specificTo.length; i++) {
                if (a.before("</" + htmlStartTagString + ">").text().contains(specificTo[i])) {
                    Node node = a.nextSibling();
                    value = specificTo[i] + " --> " + node.toString().trim();
                    list.add(value);
                }
            }
        }
        else {
            Node node = a.nextSibling();
            value = node.toString().trim();
            list.add(value);
        }
    }
    return list.toArray(new String[list.size()]);
}

You can use Element.wholeText() method to preserve line separators.您可以使用Element.wholeText()方法来保留行分隔符。

Unfortunately it looks like it also preserves depth of indentation so you would need to remove leading spaces or tabulators in each line.不幸的是,它看起来还保留了缩进深度,因此您需要删除每行中的前导空格或制表符。

Demo:演示:

String htmlString = "..."; // <--- replace with your HTML

Document doc = Jsoup.parse(htmlString);
Elements keyElements = doc.getElementsByTag("td");
for (Element keyElement : keyElements) {
    String value = keyElement
            .wholeText()
            .trim()                        
            .replaceAll("(?m)^[ \t]+",""); //remove leading spaces and tabs from each line
    System.out.println(value);
    System.out.println("---");
}

Output (based on HTML from question): Output(基于来自问题的 HTML):

Information: 

Legal Business Name
Asfdsf

Phone
(718) 43543

Principle Name 1
afdsgsfgsg df

Bus Street Address
sdfdsf

Bus City
sdfdsf

Bus State
ny

Bus Zip Code
4324324

Email Address
dsfdsfds@xyz.com

Tertiary Email Address
--- No answer ---

Business Website Address
dsfdsf.com

DBA info same as Business

DBA information is same as Business.

DBA Name
Awqeewd gdfg

DBA Street Address
dsfdsf 3432 fdgdf

DBA City
NORTH

Attachments:
---

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM