从HTML中的多个a标签获取数据

Question

我正在抓取一个医疗网站，我需要在其中提取有关药物的标头信息，例如注意事项，禁忌症，剂量，用途等。HTML数据如下所示。 如果我仅使用标签p.drug-content提取信息， p.drug-content我将所有标题下的内容视为一个大段落。 如何在标题的内容中添加标题部分的内容，在预防措施中的注意事项，等等，等等？

<a name="Warning"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">What are the warnings and precautions for Abacavir? </h2></div>
    <p class="drug-content">
                        • Caution is advised when used in patients with history of depression or at risk for heart disease<br>•  Avoid use with alcohol.<br>•  Take along with other anti-HIV drugs and not alone, to prevent resistance.<br>•  Continue other precautions to prevent spread of HIV infection.</p></div>
<a name="Prescription"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">Why is Abacavir Prescribed? (Indications) </h2></div>
    <p class="drug-content">Abacavir is an antiviral drug that is effective against the HIV-1 virus. It acts on an enzyme of the virus called reverse transcriptase, which plays an important role in its multiplication.  Though abacavir reduces viral load and may slow the progression of the disease, it does not cure the HIV infection.&nbsp;</p></div>
<a name="Dosage"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">What is the dosage of Abacavir?</h2></div>
    <p class="drug-content"> Treatment of HIV-1/AIDS along with other medications. Dose in adults is 600 mg daily, as a single dose or divided into two doses.
</p></div>

这是我的代码：

private static void ScrapingDrugInfo() throws IOException{
            Connection.Response response = null;
            Document doc = null;
            List<SideEffectsObject> sideEffectsList = new ArrayList<>();
            int i=0;

            String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};

            for (String keyword : keywords){
                final String url = "https://www.medindia.net/doctors/drug_information/home.asp?alpha=" + keyword;

                response = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0")
                        .execute();

                doc = response.parse();

                Element tds = doc.select("div.related-links.top-gray.col-list.clear-fix").first();

                Elements links = tds.select("li[class=list-item]");


                for (Element link : links){

                    final String newURL = "https://www.medindia.net/doctors/drug_information/".concat(link.select("a").attr("href")) ;

                    response = Jsoup.connect(newURL)
                            .userAgent("Mozilla/5.0")
                            .execute();

                    doc = response.parse();

                    Elements classification = doc.select("div.clear.b");
                    System.out.println("Classification::"+classification.text());

                    Elements drugBrands = doc.select("div.drug-content");
                    Elements drugBrandsIndian = drugBrands.select("div.links");

                    System.out.println("Drug Brand Links Indian::"+ drugBrandsIndian.select("a[href]"));

                    System.out.println("Drug Brand Names Indian::"+ drugBrandsIndian.text());

                    System.out.println("Drug Brand Names International::"+doc.select("div.drug-content.h3").text());

                    Elements prescritpionText = doc.select("a[name=Prescription]");
                    Elements prescriptionData = prescritpionText.select("p.drug-content");

                    System.out.println("Prescription Data::"+ prescriptionData.text());


                    Elements contraindications = doc.select("a[name=Contraindications]");

                    Elements contraindicationsText = contraindications.select("p[class=drug-content]");

                    System.out.println("Contrainidications Text::" + contraindicationsText.text());


                    Elements dosage = doc.select("a[name=Dosage]");

                    Elements dosageText = dosage.select("p[class=drug-content]");

                    System.out.println("Dosage Text::" + dosageText.text());
     }
}

Answer 1

如果我正确理解了这个问题，听起来您想将a标签name属性的值与以下div的p内容配对。 您应该可以使用以下代码执行此操作：

Elements aTags = doc.select("a[name]");

for(Element header : aTags){
    System.out.println(header.attr("name"));
    // Get the sibling div of a and get it's p content
    Element pTag = header.nextElementSibling().select("p.drug-content").first();

    System.out.println(pTag.text());
}

从HTML中的多个a标签获取数据

问题描述

1 个解决方案

解决方案1
0 2018-03-26 20:25:43

从HTML中的多个a标签获取数据

问题描述

1 个解决方案

解决方案1 0 2018-03-26 20:25:43

解决方案1
0 2018-03-26 20:25:43