简体   繁体   中英

Trying to extract content from url in java

I am trying to extract the content of a webpage from a URL. I have already written the code but I think I have made a mistake in the regex part. When I run the code only the first line appears in the console. I am using NetBeans . Code that I already have:

private static String text;
public static void main(String[]args){
URL u;
  InputStream is = null;
  DataInputStream dis;
  String s;

  try {

     u = new URL("http://ghr.nlm.nih.gov/gene/AKT1 ");

     is = u.openStream();         

     dis = new DataInputStream(new BufferedInputStream(is));


     text="";
     while ((s = dis.readLine()) != null) {
        text+=s;
     }

  } catch (MalformedURLException mue) {

     System.out.println("Ouch - a MalformedURLException happened.");
     mue.printStackTrace();
     System.exit(1);

  } catch (IOException ioe) {

     System.out.println("Oops- an IOException happened.");
     ioe.printStackTrace();
     System.exit(1);

  } finally {


      String pattern = "(?i)(<P>)(.+?)";
         System.out.println(text.split(pattern)[1]);

     try {
        is.close();
     } catch (IOException ioe) {

     }

  } 

}
}

Consider extracting your webpage information through dedicated html parsing APIs like jsoup . A simple example with your url to extract all the elements with the <p> tags would be:

public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("http://ghr.nlm.nih.gov/gene/AKT1")
                    .get();
            Elements els = doc.select("p");

            for (Element el : els) {
                System.out.println(el.text());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

Console:

On this page:
The official name of this gene is “v-akt murine thymoma viral oncogene homolog 1.”
AKT1 is the gene's official symbol. The AKT1 gene is also known by other names, listed below.
Read more about gene names and symbols on the About page.
The AKT1 gene provides instructions for making a protein called AKT1 kinase. This protein is found in various cell types throughout the body, where it plays a critical role in many signaling pathways. For example, AKT1 kinase helps regulate cell growth and division (proliferation), the process by which cells mature to carry out specific functions (differentiation), and cell survival. AKT1 kinase also helps control apoptosis, which is the self-destruction of cells when they become damaged or are no longer needed.
...

You are missing a new line character during string concatenation.
Append the text with a new line char after every line is read.

Change:

while ((s = dis.readLine()) != null) {
    text+=s;
}

To:

while ((s = dis.readLine()) != null) {
    text += s + "\n";
}

I suggest you use, StringBulder over String for building the final text.

StringBuilder text = new StringBuilder( 1024 );
...
while ((s = dis.readLine()) != null) {
    text.append( s ).append( "\n" );
}

...
System.out.println( text.toString() );

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM