I am working on a personal project and wants to parse this html and retrieve information from this.
Basically I want to get all the information that is given inside the 'br' tags,for this I am using JSOUP in java.
I want to store these value as pairs in a map (key,value).
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style> </style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="">
<div class="WordSection1">
<p class="MsoNormal">
<span style=""></span>
</p>
<p class="MsoNormal">
<span style=""></span>
</p>
<div>
<div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="MsoNormal">
<a name="_MailOriginal">
<b>
<span style="">From: </span>
</b>
</a>
<span style="">
<span style=""> ABC (membership@abc.org)
<br>
<b>Sent: </b> Tuesday, November 24, 2020 8:13 AM <br>
<b>To: </b> XYZ <XYZ@abc.com>
<br>
<b>Subject: </b> Information Request </span>
</span>
</p>
</div>
</div>
<p class="MsoNormal">
<span style=""></span>
</p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" align="left" width="100%" style="width:100.0%">
<tbody>
<tr style="">
<td style="background:#910A19; padding:5.25pt 1.5pt 5.25pt 1.5pt">
<span style=""></span>
</td>
<span style=""></span>
<td width="100%">
<div>
<p class="MsoNormal" style="">
<span style="">
<b>
<span style="font-size:12.0pt; font-family:" ` Calibri (Body)`",serif; color:#212121">EXTERNAL EMAIL: Beware of Phishing attacks! </span>
</b>
</span>
</p>
</div>
</td>
<span style=""></span>
</tr>
</tbody>
</table>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%; background:#B2B2B2">
<tbody>
<tr style="">
<td style="padding:25.0pt 25.0pt 25.0pt 25.0pt">
<div align="center">
<table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="600" style="width:6.25in; background:white; border:solid black 1.0pt">
<tbody>
<tr style="">
<td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt">
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="">
<tbody>
<tr style="">
<td style="border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 3.75pt 7.5pt 3.75pt">
<p class="MsoNormal">
<span style="">
<span style="border:solid windowtext 1.0pt; padding:0in">
<img width="100" height="100" id="_x0000_i1025" src="cid:~WRD2635.jpg" alt="Image removed by sender.">
</span>
</span>
<span style="">
<span style=""></span>
</span>
</p>
</td>
<span style=""></span>
<td width="100%" style="width:100.0%; border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 3.75pt 7.5pt 3.75pt">
<p class="MsoNormal">
<span style="">
<b>
<span style="font-size:18.0pt; font-family:" Arial",sans-serif">AWSCV </span>
</b>
</span>
</p>
</td>
<span style=""></span>
</tr>
</tbody>
</table>
<span style=""></span>
</td>
<span style=""></span>
</tr>
<tr style="">
<td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt">
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" style="">
<tbody>
<tr style="">
<td style="border:none; border-bottom:solid #CDCDCD 1.0pt; padding:7.5pt 7.5pt 7.5pt 7.5pt">
<p class="MsoNormal">
<span style="">
<span style="font-size:9.0pt; font-family:" Arial",sans-serif">Dear XYZ, <br>
<br>The following Information Request form was submitted by ABC, Company: asd, Email: asd@abc.com on 11/23/2020. <br>
<br>Information: <br>
<br>Legal Business Name <br>Asfdsf <br>
<br>Phone <br>(718) 43543 <br>
<br>Principle Name 1 <br>afdsgsfgsg df <br>
<br>EIN <br>04543 <br>
<br>Bus Street Address <br>fdgfdgfdg <br>
<br>Bus City <br>fgfdvgdsgs <br>
<br>Bus State <br>dsf <br>
<br>Bus Zip Code <br>34543534 <br>
<br>Email Address <br>abc@gamil.com <br>
<br>Secondary Email Address <br>abc@gamil.com <br>
<br>Business Website Address <br>NOEMAIL.COM <br>
<br>DBA info same as Business <br>
<br>DBA information is same as Business. <br>
<br>DBA Name <br>Asfdsf <br>
<br>DBA Street Address <br>sgfdgfdg435435 34 <br>
<br>DBA City <br>ACDCROCK <br>
<br>DBA State <br>AT <br>
<br>DBA Zip Code <br>324324 <br>
<br>DBA Phone <br>(458) 43543543 <br>
<br>DBA Email Address <br>abc@gamil.com <br><br>Secondary DBA Email Address <br>--- No answer --- <br><br>Tertiary DBA Email Address <br>--- No answer --- <br><br>DBA Website Address <br>NOEMAIL.COM <br><br>Secondary DBA Website Address <br>--- No answer --- <br><br>Tertiary DBA Website Address <br>--- No answer --- <br><br>Information Request Text <br>Any information would be helpful <br><br> Description <br>ACCESSORIES <br><br>wegf <br>4545 <br><br>Point of Sale Type <br>dfgfdg/sdgfdsgdsg (Default) <br><br><br><br>Attachments: </span></span>
</p><table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="600" style="width:6.25in; background:white; border:outset black 1.0pt"><tbody><tr style=""><td style="padding:2.0pt 2.0pt 2.0pt 2.0pt"><p class="MsoNormal"><span style=""><span style="color:black">Attachments </span></span><span style=""><span style=""></span></span></p></td><span style=""></span><td style="padding:2.0pt 2.0pt 2.0pt 2.0pt"><p class="MsoNormal"><span style=""><span style="color:black"></span></span><span style=""><span style=""></span></span></p></td><span style=""></span></tr></tbody></table><p class="MsoNormal"><span style=""><span style="font-size:9.0pt; font-family:" Arial",sans-serif"><br><br>Your type includes you in the list of members to whom forms of this type are sent. You can opt out of receiving forms of this type via the Forms link on your Profile screen. </span></span></p>
</td><span style=""></span>
</tr>
</tbody>
</table><span style=""></span>
</td><span style=""></span>
</tr><tr style=""><td style="border:none; padding:2.0pt 2.0pt 2.0pt 2.0pt"><div><p class="MsoNormal"><span style=""><i><span style="font-size:7.5pt; color:#666666">This email was sent in response to the use of the platform and website by AWCC. It was generated by: </span></i></span></p><div style="margin-left:11.25pt; margin-top:3.0pt"><p class="MsoNormal"><span style=""><i><span style="font-size:7.5pt; color:#666666">AAXC, LLC <br>43543543 fgfdgfdg <br>AXD, WE 324324 <br>dgfdgfdgfd (457-dsfds) - Outside the US, call +1 45435435435 </span></i></span></p></div></div></td><span style=""></span></tr>
</tbody>
</table>
</div><span style=""></span>
</td><span style=""></span>
</tr>
</tbody>
</table><span style=""></span><p class="MsoNormal"><span style=""></span></p>
</div>
</body>
</html>
I am using this code to fetch but this is giving all values in a paragraph.
Document doc = Jsoup.parse(htmlString);
List<String> valueList = new ArrayList<>();
Elements keyElements = doc.getElementsByTag("td");
for (Element keyElement : keyElements) {
String value = keyElement.text();
// store in value list
}
I also tried
doc.getElementsByTag("br");
but his is giving empty value.
I want to store each the values in a map like this but not able to separate the value from html as this is coming in paragraph or empty.
My Map..
Key VALUE
Phone (718) 3543
Legal Business Name Asfdsf
DBA City XYXXdsfds
... and so on
Can someone please help me to get this data in a better way?
it must be getElementsByTagName
. TT
You can use this solution:
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
doc.outputSettings(outputSettings);
doc.select("br").before("\\n");;
doc.select("p").before("\\n");
String str = doc.html().replaceAll("\\\\n", "\n");
String strWithNewLines = Jsoup.clean(str, "", Safelist.none(), outputSettings);
System.out.println(strWithNewLines);
I suppose you can try this:
If the HTML String was this:
String html = "<html>\n"
+ " </head>\n"
+ "<table class=\"MsoNormalTable\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"\">\n"
+ " <tbody>\n"
+ " <tr style=\"\">\n"
+ " <td>\n"
+ " <p class=\"MsoNormal\">\n"
+ " <span style=\"\">\n"
+ " <span style=\"font-size:9.0pt; font-family:\"Arial\",sans-serif\">\n"
+ " <br>\n"
+ " <br>Information: \n"
+ " <br>\n"
+ " <br>Legal Business Name\n"
+ " <br>Asfdsf\n"
+ " <br>\n"
+ " <br>Phone\n"
+ " <br>(718) 43543\n"
+ " <br>\n"
+ " <br>Principle Name 1\n"
+ " <br>afdsgsfgsg df\n"
+ " <br>\n"
+ " <br>Bus Street Address\n"
+ " <br>sdfdsf\n"
+ " <br>\n"
+ " <br>Bus City\n"
+ " <br>sdfdsf\n"
+ " <br>\n"
+ " <br>Bus State\n"
+ " <br>ny\n"
+ " <br>\n"
+ " <br>Bus Zip Code\n"
+ " <br>4324324\n"
+ " <br>\n"
+ " <br>Email Address\n"
+ " <br>dsfdsfds@xyz.com\n"
+ " <br>\n"
+ " <br>Tertiary Email Address\n"
+ " <br>--- No answer ---\n"
+ " <br>\n"
+ " <br>Business Website Address\n"
+ " <br>dsfdsf.com\n"
+ " <br>\n"
+ " <br>DBA info same as Business\n"
+ " <br>\n"
+ " <br>DBA information is same as Business.\n"
+ " <br>\n"
+ " <br>DBA Name\n"
+ " <br>Awqeewd gdfg\n"
+ " <br>\n"
+ " <br>DBA Street Address\n"
+ " <br>dsfdsf 3432 fdgdf\n"
+ " <br>\n"
+ " <br>DBA City\n"
+ " <br>NORTH\n"
+ " <br>\n"
+ " <br>Attachments:\n"
+ " </span>\n"
+ " </span>\n"
+ " </p>\n"
+ " <p class=\"MsoNormal\">\n"
+ " <span style=\"\"> \n"
+ " </span>\n"
+ " </p>\n"
+ " </div>\n"
+ " </body>\n"
+ " </html>";
And you run this string through the following method provided below:
String[] values = getTextAfterHtmlStartEndTags(html, "br");
// Display the discovered values...
for (String str : values) {
System.out.println(str);
}
The console Window will display:
Information:
Legal Business Name
Asfdsf
Phone
(718) 43543
Principle Name 1
afdsgsfgsg df
Bus Street Address
sdfdsf
Bus City
sdfdsf
Bus State
ny
Bus Zip Code
4324324
Email Address
dsfdsfds@xyz.com
Tertiary Email Address
--- No answer ---
Business Website Address
dsfdsf.com
DBA info same as Business
DBA information is same as Business.
DBA Name
Awqeewd gdfg
DBA Street Address
dsfdsf 3432 fdgdf
DBA City
NORTH
Attachments:
The getTextAfterHtmlStartEndTags() method:
/**
*
* To be used with the JSoup API<br><br>
* <b>Example Usage:</b><br><pre>
*
* <b>Required Imports:</b>
*
* import org.jsoup.Jsoup;
* import org.jsoup.nodes.Document;
* import org.jsoup.nodes.Element;
* import org.jsoup.nodes.Node;
* import org.jsoup.select.Elements;
*
* <b>Example Code:</b>
*
* {@code String html = "<td>\n"
* + " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
* + " <span class=\"detailh2\">Total: </span> 31 704 \n"
* + " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
* + "</td>";
*
* String[] values = getTextAfterHtmlStartEndTags(html, "span");
* for (String str : values) {
* System.out.println(str);
* }}</pre><br>
* <p>
* The console window will display:
* <pre>
*
* 2 145
* 31 704
* 30.12.2021</pre><br>
* <p>
* If you want the data from a specific HTML tag element then you can supply
* one or more text elements within those HTML tags in th optional
* 'specificTo' parameter as a string array or as args, for example:
* <pre>
*
* {@code String[] values = getTextAfterHtmlStartEndTags(html, "span", "This month:", "Total:");
* for (String str : values) {
* System.out.println(str);
* }}</pre><br>
* <p>
* The console window will display:
* <pre>
*
* This month: --> 2 145
* Total: --> 31 704</pre>
*
* @param htmlString (String) The HTML string to parse.<br>
*
* @param htmlStartTagString (String) The HTML start tag to get data
* from.<br>
*
* @param specificTo (String - args) The desired data from multiple
* HTML tags of the same type (see the above
* example code).<br>
*
* @return (String[] Array) A single Dimensional String Array containing the
* desired data (if properly parsed and found).
*/
public static String[] getTextAfterHtmlStartEndTags(String htmlString,
String htmlStartTagString, String... specificTo) {
String html = htmlString;
List<String> list = new ArrayList<>();
String value = "N/A";
Document doc = Jsoup.parse(html);
Elements elements = doc.select(htmlStartTagString);
for (Element a : elements) {
if (specificTo.length > 0) {
for (int i = 0; i < specificTo.length; i++) {
if (a.before("</" + htmlStartTagString + ">").text().contains(specificTo[i])) {
Node node = a.nextSibling();
value = specificTo[i] + " --> " + node.toString().trim();
list.add(value);
}
}
}
else {
Node node = a.nextSibling();
value = node.toString().trim();
list.add(value);
}
}
return list.toArray(new String[list.size()]);
}
You can use Element.wholeText()
method to preserve line separators.
Unfortunately it looks like it also preserves depth of indentation so you would need to remove leading spaces or tabulators in each line.
Demo:
String htmlString = "..."; // <--- replace with your HTML
Document doc = Jsoup.parse(htmlString);
Elements keyElements = doc.getElementsByTag("td");
for (Element keyElement : keyElements) {
String value = keyElement
.wholeText()
.trim()
.replaceAll("(?m)^[ \t]+",""); //remove leading spaces and tabs from each line
System.out.println(value);
System.out.println("---");
}
Output (based on HTML from question):
Information:
Legal Business Name
Asfdsf
Phone
(718) 43543
Principle Name 1
afdsgsfgsg df
Bus Street Address
sdfdsf
Bus City
sdfdsf
Bus State
ny
Bus Zip Code
4324324
Email Address
dsfdsfds@xyz.com
Tertiary Email Address
--- No answer ---
Business Website Address
dsfdsf.com
DBA info same as Business
DBA information is same as Business.
DBA Name
Awqeewd gdfg
DBA Street Address
dsfdsf 3432 fdgdf
DBA City
NORTH
Attachments:
---
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.