简体   繁体   English

"使用 WebKit 和 Selenium 工具抓取 Javascript 网站"

[英]Javascript website scraping using WebKit and Selenium tools

I tried scraping a javascript website using two tools, both didn't work.我尝试使用两个工具来抓取一个 javascript 网站,但都不起作用。 The website link is: http:\/\/xx.xxx.com\/category-499399872.htm<\/a> The relevant text I'm trying to extract is GY-68...<\/strong> :该网站链接是:http: \/\/xx.xxx.com\/category-499399872.htm<\/a>我试图提取的相关文本是GY-68 ...<\/strong> :

<div class="item3line1">

    <dl class="item " data-id="38952795780">
        <dt class="photo">
            <a target="_blank" href="//item.xxx.com/item.htm?spm=a1z10.5-c.w4002-6778075404.11.54MDOI&id=38952795780" data-spm-wangpu-module-id="4002-6778075404" data-spm-anchor-id="a1z10.5-c.w4002-6778075404.11">
                <img src="//img.xxx.com/bao/uploaded/i4/TB1HMt3FFXXXXaFaVXXXXXXXXXX_!!0-item_pic.jpg_240x240.jpg" alt="GY-68 BMP180 新款 BOSCH温度 气压传感器模块 代替BMP085"></img>
            </a>
        </dt>

也许这是一个愚蠢的建议,但是您正在尝试通过类名称“ col-main”查找元素,而示例代码的类名称为“ item-name”。

There is a space in the class name it is 'item ' not 'item'.For that you have to rewrite the xpath as 在类名中有一个空格是'item'而不是'item'。为此您必须将xpath重写为

  //dl[@class="item "]/dt[@class="photo"]/a/img

There is an option to override that.You can use normalize-space() function which strips leading and trailing white-space from a string. 您可以使用normalize-space()函数从字符串中去除开头和结尾的空格。

  //dl[normalize-space(@class)="item"]/dt[@class="photo"]/a/img

Or you can go with 或者你可以去

  //a[@class='item-name']

also refers to the element and the text is equal to the img's alt attribute 也指元素,文本等于img的alt属性

You must check these scraping websites .您必须检查这些抓取网站 These are best scraping tools and I am using them.这些是最好的抓取工具,我正在使用它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM