php - webscraping - 单击ajax调用然后抓取页面（可以在python中执行）

Question

我在python中有一些代码可以擦除页面并查找类名为“group-head”的所有元素，然后单击它们以生成一个执行了所有ajax调用的页面。 这适用于python，但我想知道它是否可以在curl / php编码上完成？

 # Get scraping...
tree = parseLxml(driver=driver, url=url) # Go to URL and parse
elem = driver.find_elements_by_class_name('group-head') # Use ChromeDriver to find element to get to the Ajax call
for x in range(0,len(elem)): # Loop through all such elements
    try:  
        time.sleep(0.5)
        elem[x].click() # Click the element
        time.sleep(1.5) # Too fast and errors can occur, so wait...
    except:
        pass
newpage = driver.page_source # Need to get page source again now all visible
newtree = html.fromstring(newpage)
match = newtree.xpath('//td[contains(@class,"score-time")]/a/@href') # Scrape match link
base = 'http://uk.soccerway.com'
for m in match:
    mURL = base+str(m)
    print ('Match URL:',mURL)

Answer 1

您的代码使用的是ChromeDriver，因此您应该寻找PHP绑定。

看看https://github.com/facebook/php-webdriver ，你应该能够以同样的方式使用它。 代码未经测试但应如下所示：

$host = 'http://localhost:4444/wd/hub'; // Selenium Host
$driver = ChromeDriver::create($host);
$driver->get($url); // Got to Url and Load Page
$elements = $driver->findElements(WebDriverBy::className('group-head'));
....

Answer 2

是的，这可能与PHP :)

但你必须遵循这些步骤..

1）从这里下载Dom Parser for PHP

2）在点击页面中的链接时，您可以使用获取文件内容的ajax调用(file_get_html) 。

3）最后使用其id，element，classname获取所需的数据。

$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

php - webscraping - 单击ajax调用然后抓取页面（可以在python中执行）

问题描述

2 个解决方案

解决方案1
0 2015-12-22 12:37:22

解决方案2
-1 2015-12-29 12:09:47

php - webscraping - 单击ajax调用然后抓取页面（可以在python中执行）

问题描述

2 个解决方案

解决方案1 0 2015-12-22 12:37:22

解决方案2 -1 2015-12-29 12:09:47

解决方案1
0 2015-12-22 12:37:22

解决方案2
-1 2015-12-29 12:09:47