[英]How can i ignore this tag directly for using only text information with using JSOUP?
I am making a scraper which is scraping product price and i need to ignore like this site div class but it is changeable for all web sites so this is really problem for me.我正在制作一个抓取产品价格的刮刀,我需要像这个站点 div 类一样忽略它,但它对于所有网站都是可变的,所以这对我来说真的是个问题。 You can see here i scrape first element and it comes like this
你可以在这里看到我刮了第一个元素,它是这样的
1 - <div class="ProductPrice">
<span id="ctl00_ContentPlaceHolder1_Category1_ctrl_0_ctrl_7_mainGrid_ctl00_PUnit_lblPriceWithTax">47,00 TL</span>
</div>
Then i scrape second one this scrape again tag(tag names can be changeable so pls consider this before answer)然后我刮第二个这个再次刮标签(标签名称可以改变所以请在回答之前考虑这个)
2 - <span id="ctl00_ContentPlaceHolder1_Category1_ctrl_0_ctrl_7_mainGrid_ctl00_PUnit_lblPriceWithTax">47,00 TL</span>
My code is :我的代码是:
Elements allElements = newDocument.getAllElements();
for (int j = 0; j < allElements.size(); j++) {
Element element = allElements.get(j);
if (element.text().matches(regex){
// Writing to console.
}
}
I would try (untested code):我会尝试(未经测试的代码):
Elements elements = newDocument.select("div[class*=ProductPrice]");
for (Element element : elements) {
Element inner = element.html();
//do whatever you want with "inner", containing your span
}
Edit: After your comment, I think, you should use Elements elements = newDocument.select("*:matches(regex)");
编辑:在您发表评论后,我认为您应该使用
Elements elements = newDocument.select("*:matches(regex)");
, with "regex" the regular expression you need to extract a price. ,使用“regex”是您需要提取价格的正则表达式。 This should give you the liste of elements you need, without using
element.html();
这应该为您提供所需的元素列表,而无需使用
element.html();
: :
Elements elements = newDocument.select("*:matches("+regex+")");
for (Element element : elements) {
//do whatever you want with "inner", containing your span
}
while (loopBool)
{
if (element.children() != null)
{
if (element.children().size() >= k)
{
if (!element.child(k).text().matches(regex))
{
k++;
}
else
{
element.empty();
loopBool = false;
}
}
else
{
k = 0;
element = element.child(k);
}
}
}
I solved this problem with controlling element has got a children ?我用控制元素解决了这个问题有一个孩子? then if it has children then check they match regex, if they not match just circulate childs to find acceptable element.
然后如果它有孩子然后检查他们是否匹配正则表达式,如果他们不匹配就循环孩子以找到可接受的元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.