简体   繁体   中英

Extract HTML data using Jsoup

I have a table with ID,TEXT,etc columns Here TEXT is clob column which contains data in HTML FORMAT

SAMPLE DATA :

<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm<o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<SPAN style="mso-spacerun: yes">  </SPAN>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<SPAN style="mso-spacerun: yes">  </SPAN>The following items represent the scope and visit focus areas:<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program<o:p></o:p></SPAN></P>

I'm using java transformation imported with Jsoup.jar file in informatica. When I use Jsoup.parse(AUDIT_SCOPE_LOB).toString(); I'm getting data like below

<html>
 <head></head>
 <body>
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am
    <!--?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /-->
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<span style="mso-spacerun: yes"> </span>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<span style="mso-spacerun: yes"> </span>The following items represent the scope and visit focus areas:
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program
    <o:p></o:p></span></p> 
 </body>
</html>

When I use Jsoup.parse(AUDIT_SCOPE_LOB).text(); I'm getting data like below

Start: 8:30 am End: 4 pm The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas: 1. SOP Program 2. Training Program 3. Calibration/Preventive Maintenance Program

I dont know much about java. Can i get java code to extract data using jsoup and retrun the outpu like below

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program

Actually this data is a sample data. I have data with html tags which are not mentioned here.

Since the information is divided between <p> tags, you have to select all of these tags, and print their text one by one, assuming that AUDIT_SCOPE_LOB is a valid Java String :

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB);
    Elements el = doc.select("p");
    for (Element e : el) {
        System.out.println(e.text());
    }

org.jsoup.nodes.Element.toString() returns org.jsoup.nodes.Element.outerHTML()

Get the outer HTML of this node.


org.jsoup.nodes.Element.text()

Gets the combined text of this element and all its children. Whitespace is normalized and trimmed.


So invoking toString() on your entire sample will return the same as output. Likewise, invoking text() will return all the text without the markup, as a single String. However, what you want is individual Strings for each paragraph of text.


Some of your paragraph tags are empty. In order to get the output in your example, you should validate each paragraph has text first.

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB, "UTF-8");

for (Element p : doc.select("p"))
    if (p.hasText())
        System.out.println(p.text());

Output

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined bel ow. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program

Take a look at CSS Selectors for more examples of how to parse out your data. For instance, if you wanted to parse out the ordered list, you could select on the class name and retrieve the 2nd span in the list.

for (Element span : doc.select("p.MsoNormal > span:nth-child(2)")) 
     System.out.println(span.ownText());

Output

SOP Program
Training Program
Calibration/Preventive Maintenance Program

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM