简体   繁体   English

当网址为静态时,如何从多页信息中抓取数据?

[英]How do I scrape data off of multiple pages of info when the URL is static?

I'm learning how to scrape data from a webpage using R. The website I'm working with is: 我正在学习如何使用R从网页中抓取数据。我正在使用的网站是:

http://sheriff.franklincountyohio.gov/search/real-estate/results.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014+12%3a00%3a00+AM%26foreclosureType%3d%26sortType%3ddefendant%26saleDateFrom%3d%26saleDateTo%3d http://sheriff.franklincountyohio.gov/search/real-estate/results.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014+12%3a00%3a00+AM%26foreclosureType%3d%26sortType%3ddefendant% 26saleDateFrom%3D%26saleDateTo%3D

The problem is that listings aren't on 1 page, but in this case, on 7 different pages. 问题在于列表不在1页上,而是在7个不同的页面上。 The user navigates to the next page via arrow buttons at the bottom. 用户通过底部的箭头按钮导航到下一页。 However, the URL is static. 但是,URL是静态的。 Whether on page 1 or 5, the URL stays the same. 无论是在第1页还是第5页,URL都保持不变。 So I don't know how to point R to the next page to retrieve the additional information. 因此,我不知道如何将R指向下一页以检索其他信息。

Currently I use readLines to get the data off the page. 目前,我使用readLines来获取页面中的数据。

con <- url("http://sheriff.franklincountyohio.gov/search/real-estate/results.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014%26foreclosureType%3d%26sortType%3ddefendant")
html <- readLines(con)
close(con)

And then the XML package to start parsing out the data I want. 然后使用XML包开始解析我想要的数据。

html.data <- htmlTreeParse(html, useInternalNodes = TRUE)

I've had trouble using XML, RCurl and httr packages at work because of the firewall. 由于防火墙的缘故,我在使用XML,RCurl和httr软件包时遇到了麻烦。 The method above seems to be the only way I can scrape the data. 上面的方法似乎是我抓取数据的唯一方法。 So I might be limited in functions to follow a link. 因此,我可能会受限于跟踪链接的功能。

Any help would be appreciated! 任何帮助,将不胜感激! I've searched a bunch and can't seem to find an answer. 我搜索了一堆,似乎找不到答案。

Within the webpage you have the "Print Sale List" button which display a new one that has all the information compiled in a single page (maybe at the time you post the question, the webpage didn't have that button). 在网页中,您具有“打印销售清单”按钮,该按钮显示一个新的清单,该信息将所有信息汇总到一个页面中(也许在您发布问题时,该网页没有该按钮)。

url<-'http://sheriff.franklincountyohio.gov/search/real-estate/printresults.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014+12%3a00%3a00+AM%26foreclosureType%3d%26sortType%3ddefendant%26saleDateFrom%3d%26saleDateTo%3d'
table<-readHTMLTable(url)
table1<-as.data.frame(table)
str(table1)
'data.frame':   92 obs. of  8 variables:
 $ c_printsearchresults_gvResults.Case.Number         : Factor w/ 92 levels "07CV4653\r\n                        PLURIESBANKRUPTCY",..: 23 47 33 90 91 82 85 77 68 83 ...
 $ c_printsearchresults_gvResults.Property.Address    : Factor w/ 92 levels "1038\r\n                        \r\n                        \r\n                        S OHIO AVENUE\r\n                      "| __truncated__,..: 7 80 85 26 79 37 83 55 51 33 ...
 $ c_printsearchresults_gvResults.Plaintiff...Attorney: Factor w/ 83 levels "Plaintiff:\r\n                        \r\n                        BAC HOME LOANS SERVICING LP FKA COUNTRYWIDE HOME LOANS SERVIC"| __truncated__,..: 5 31 80 74 49 14 73 52 39 41 ...
 $ c_printsearchresults_gvResults.Defendant           : Factor w/ 92 levels "ADEDEJI-FAJOBI/MODUPE/O",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ c_printsearchresults_gvResults.Appraised           : Factor w/ 59 levels "$10,268.33","$10,988.28",..: 48 20 18 10 25 6 41 58 35 15 ...
 $ c_printsearchresults_gvResults.Opening.Bid         : Factor w/ 63 levels "$10,268.33","$10,988.28",..: 38 5 4 52 11 51 29 45 23 63 ...
 $ c_printsearchresults_gvResults.Deposit             : Factor w/ 61 levels "$1,200.00","$10,268.33",..: 49 20 18 53 26 7 42 58 28 16 ...
 $ c_printsearchresults_gvResults.Sale.Date           : Factor w/ 1 level "12/26/2014": 1 1 1 1 1 1 1 1 1 1 ...

If you want to remove or separate the data in more columns, you can use regular expressions. 如果要删除或分隔更多列中的数据,则可以使用正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM