简体   繁体   中英

Getting Data from multiple a tags in HTML

I am scraping a medical website where I need to extract header wise information regarding a drug eg Precautions, Contraindications,Dosage, Uses etc. The HTML data looks like below. If I just extract info using the tag p.drug-content I get content under all the headers as one big paragraph. How do I get header wise content where the paragraph for dosage should come under dosage, Precautions under precautions, so on and so forth?

<a name="Warning"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">What are the warnings and precautions for Abacavir? </h2></div>
    <p class="drug-content">
                        • Caution is advised when used in patients with history of depression or at risk for heart disease<br>•  Avoid use with alcohol.<br>•  Take along with other anti-HIV drugs and not alone, to prevent resistance.<br>•  Continue other precautions to prevent spread of HIV infection.</p></div>
<a name="Prescription"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">Why is Abacavir Prescribed? (Indications) </h2></div>
    <p class="drug-content">Abacavir is an antiviral drug that is effective against the HIV-1 virus. It acts on an enzyme of the virus called reverse transcriptase, which plays an important role in its multiplication.  Though abacavir reduces viral load and may slow the progression of the disease, it does not cure the HIV infection.&nbsp;</p></div>
<a name="Dosage"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">What is the dosage of Abacavir?</h2></div>
    <p class="drug-content"> Treatment of HIV-1/AIDS along with other medications. Dose in adults is 600 mg daily, as a single dose or divided into two doses.
</p></div>

Here is my code:

private static void ScrapingDrugInfo() throws IOException{
            Connection.Response response = null;
            Document doc = null;
            List<SideEffectsObject> sideEffectsList = new ArrayList<>();
            int i=0;

            String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};

            for (String keyword : keywords){
                final String url = "https://www.medindia.net/doctors/drug_information/home.asp?alpha=" + keyword;

                response = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0")
                        .execute();

                doc = response.parse();

                Element tds = doc.select("div.related-links.top-gray.col-list.clear-fix").first();

                Elements links = tds.select("li[class=list-item]");


                for (Element link : links){

                    final String newURL = "https://www.medindia.net/doctors/drug_information/".concat(link.select("a").attr("href")) ;

                    response = Jsoup.connect(newURL)
                            .userAgent("Mozilla/5.0")
                            .execute();

                    doc = response.parse();

                    Elements classification = doc.select("div.clear.b");
                    System.out.println("Classification::"+classification.text());

                    Elements drugBrands = doc.select("div.drug-content");
                    Elements drugBrandsIndian = drugBrands.select("div.links");

                    System.out.println("Drug Brand Links Indian::"+ drugBrandsIndian.select("a[href]"));

                    System.out.println("Drug Brand Names Indian::"+ drugBrandsIndian.text());

                    System.out.println("Drug Brand Names International::"+doc.select("div.drug-content.h3").text());

                    Elements prescritpionText = doc.select("a[name=Prescription]");
                    Elements prescriptionData = prescritpionText.select("p.drug-content");

                    System.out.println("Prescription Data::"+ prescriptionData.text());


                    Elements contraindications = doc.select("a[name=Contraindications]");

                    Elements contraindicationsText = contraindications.select("p[class=drug-content]");

                    System.out.println("Contrainidications Text::" + contraindicationsText.text());


                    Elements dosage = doc.select("a[name=Dosage]");

                    Elements dosageText = dosage.select("p[class=drug-content]");

                    System.out.println("Dosage Text::" + dosageText.text());
     }
}

If I understand the question correctly, it sounds like you want to pair up the value of the a tags name attribute with the p content of the following div. You should be able to do that with the following code:

Elements aTags = doc.select("a[name]");

for(Element header : aTags){
    System.out.println(header.attr("name"));
    // Get the sibling div of a and get it's p content
    Element pTag = header.nextElementSibling().select("p.drug-content").first();

    System.out.println(pTag.text());
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM