简体   繁体   English

如何使用R从帧中的网站抓取数据?

[英]How can I scrape data from a website within a frame using R?

The following link contains the results of the marathon of Paris: http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon . 以下链接包含巴黎马拉松的结果: http//www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon I want to scrape these results, but the information lies within a frame. 我想刮掉这些结果,但信息在一个框架内。 I know the basics of scraping with Rvest and Rselenium, but I am clueless on how to retrieve the data within such a frame. 我知道使用Rvest和Rselenium进行抓取的基础知识,但我对如何在这样的框架内检索数据毫无头绪。 To get an idea, one of the things I tried was: 为了得到一个想法,我尝试的其中一件事是:

url = "http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon"
site = read_html(url)
ParisResults = site %>% html_node("iframe") %>% html_table()
ParisResults = as.data.frame(ParisResults)

Any help in solving this problem would be very welcome! 任何帮助解决这个问题都是非常受欢迎的!

The results are loaded by ajax from the following url : 结果由以下URL中的ajax加载:

url="http://www.aso.fr/massevents/resultats/ajax.php?v=1460995792&course=mar16&langue=us&version=3&action=search"
  table <- url %>%
    read_html(encoding="UTF-8") %>%
    html_nodes(xpath='//table[@class="footable"]') %>%
    html_table()

PS: I don't know what ajax is exactly, and I just know basics of rvest PS:我不知道ajax到底是什么,我只知道rvest的基础知识

EDIT: in order to answer the question in the comment: I don't have a lot of experience in web scraping. 编辑:为了回答评论中的问题:我没有很多网络抓取经验。 If you only use very basic technics with rvest or xml, you have to understand a little more the web site, and every site has its own structure. 如果你只使用rvest或xml的非常基本的技术,你必须了解更多的网站,每个网站都有自己的结构。 For this one, here is how I did: 对于这个,我是这样做的:

  1. As you see, in the source code you don't see any results because they are in an iframe, and when inspecting the code, you can see after "RESULTS OF 2016 EDITION": 如您所见,在源代码中您没有看到任何结果,因为它们位于iframe中,并且在检查代码时,您可以在“2016版本的结果”后看到:

    class="iframe-xdm iframe-resultats" data-href="http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=3" class =“iframe-xdm iframe-resultats”data-href =“http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=3”

  2. Now you can use directly this url : http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=2 现在你可以直接使用这个网址: http//www.aso.fr/massevents/resultats/index.php? lang = us&course = mar16&version = 2

  3. But you still can get the results. 但你仍然可以得到结果。 You can then use Chrome developer tools > Network > XHR. 然后,您可以使用Chrome开发人员工具>网络> XHR。 When refreshing the page, you can see that the data is loaded from this url (when you choose the Men category) : http://www.aso.fr/massevents/resultats/ajax.php?course=mar16&langue=us&version=2&action=search&fields%5Bsex%5D=F&limiter=&order= 刷新页面时,您可以看到数据是从此URL加载的(当您选择Men类别时): http//www.aso.fr/massevents/resultats/ajax.php? course = mar16&lang = us&version = 2&action =搜索和字段%5Bsex%5D = F&限制器=&顺序=

  4. Now you can get the results ! 现在你可以得到结果!

  5. And if you want the second page, etc. you can click on the number of the page, then use developer tool to see what happens ! 如果你想要第二页等,你可以点击页面的编号,然后使用开发人员工具看看会发生什么!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM