如何在使用 JSOUP 时仅使用文本信息而直接忽略此标签？

Question

I am making a scraper which is scraping product price and i need to ignore like this site div class but it is changeable for all web sites so this is really problem for me.我正在制作一个抓取产品价格的刮刀，我需要像这个站点 div 类一样忽略它，但它对于所有网站都是可变的，所以这对我来说真的是个问题。 You can see here i scrape first element and it comes like this你可以在这里看到我刮了第一个元素，它是这样的

1 - <div class="ProductPrice"> 
     <span id="ctl00_ContentPlaceHolder1_Category1_ctrl_0_ctrl_7_mainGrid_ctl00_PUnit_lblPriceWithTax">47,00 TL</span> 
    </div>

Then i scrape second one this scrape again tag(tag names can be changeable so pls consider this before answer)然后我刮第二个这个再次刮标签（标签名称可以改变所以请在回答之前考虑这个）

 2 - <span id="ctl00_ContentPlaceHolder1_Category1_ctrl_0_ctrl_7_mainGrid_ctl00_PUnit_lblPriceWithTax">47,00 TL</span>

My code is :我的代码是：

Elements allElements = newDocument.getAllElements();
        for (int j = 0; j < allElements.size(); j++) {
            Element element = allElements.get(j);
            if (element.text().matches(regex){
             // Writing to console.
            }
         }

Answer 1

I would try (untested code):我会尝试（未经测试的代码）：

Elements elements = newDocument.select("div[class*=ProductPrice]");
for (Element element : elements) {
    Element inner = element.html();
    //do whatever you want with "inner", containing your span
}

Edit: After your comment, I think, you should use Elements elements = newDocument.select("*:matches(regex)");编辑：在您发表评论后，我认为您应该使用Elements elements = newDocument.select("*:matches(regex)"); , with "regex" the regular expression you need to extract a price. ，使用“regex”是您需要提取价格的正则表达式。 This should give you the liste of elements you need, without using element.html();这应该为您提供所需的元素列表，而无需使用element.html(); : ：

Elements elements = newDocument.select("*:matches("+regex+")");
for (Element element : elements) {
    //do whatever you want with "inner", containing your span
}

Answer 2

   while (loopBool)
    {
        if (element.children() != null)
        {
            if (element.children().size() >= k)
            {
                if (!element.child(k).text().matches(regex))
                {

                    k++;
                }
                else
                {
                    element.empty();
                    loopBool = false;
                }

            }
            else
            {
                k = 0;
                element = element.child(k);
            }
        }

    }

I solved this problem with controlling element has got a children ?我用控制元素解决了这个问题有一个孩子？ then if it has children then check they match regex, if they not match just circulate childs to find acceptable element.然后如果它有孩子然后检查他们是否匹配正则表达式，如果他们不匹配就循环孩子以找到可接受的元素。

如何在使用 JSOUP 时仅使用文本信息而直接忽略此标签？

问题描述

2 个解决方案

解决方案1
0 2012-06-21 07:56:30

解决方案2
0 已采纳 2012-06-22 06:45:49

如何在使用 JSOUP 时仅使用文本信息而直接忽略此标签？

问题描述

2 个解决方案

解决方案1 0 2012-06-21 07:56:30

解决方案2 0 已采纳 2012-06-22 06:45:49

解决方案1
0 2012-06-21 07:56:30

解决方案2
0 已采纳 2012-06-22 06:45:49