简体   繁体   中英

How can i ignore this tag directly for using only text information with using JSOUP?

I am making a scraper which is scraping product price and i need to ignore like this site div class but it is changeable for all web sites so this is really problem for me. You can see here i scrape first element and it comes like this

1 - <div class="ProductPrice"> 
     <span id="ctl00_ContentPlaceHolder1_Category1_ctrl_0_ctrl_7_mainGrid_ctl00_PUnit_lblPriceWithTax">47,00 TL</span> 
    </div>

Then i scrape second one this scrape again tag(tag names can be changeable so pls consider this before answer)

 2 - <span id="ctl00_ContentPlaceHolder1_Category1_ctrl_0_ctrl_7_mainGrid_ctl00_PUnit_lblPriceWithTax">47,00 TL</span>

My code is :

Elements allElements = newDocument.getAllElements();
        for (int j = 0; j < allElements.size(); j++) {
            Element element = allElements.get(j);
            if (element.text().matches(regex){
             // Writing to console.
            }
         }

I would try (untested code):

Elements elements = newDocument.select("div[class*=ProductPrice]");
for (Element element : elements) {
    Element inner = element.html();
    //do whatever you want with "inner", containing your span
}

Edit: After your comment, I think, you should use Elements elements = newDocument.select("*:matches(regex)"); , with "regex" the regular expression you need to extract a price. This should give you the liste of elements you need, without using element.html(); :

Elements elements = newDocument.select("*:matches("+regex+")");
for (Element element : elements) {
    //do whatever you want with "inner", containing your span
}
   while (loopBool)
    {
        if (element.children() != null)
        {
            if (element.children().size() >= k)
            {
                if (!element.child(k).text().matches(regex))
                {

                    k++;
                }
                else
                {
                    element.empty();
                    loopBool = false;
                }

            }
            else
            {
                k = 0;
                element = element.child(k);
            }
        }

    }

I solved this problem with controlling element has got a children ? then if it has children then check they match regex, if they not match just circulate childs to find acceptable element.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM