简体   繁体   English

我如何从R中的此网站通过Web抓取信息?

[英]How can i web scrape information from this website in R?

This website http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp is for searching nyc building application information. 该网站http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp用于搜索nyc建筑物应用程序信息。 Under the "Application Searches" section, there is "BIS Job Number:", so the information I want to extract is from the new page after I enter a job number and then click "go". 在“应用程序搜索”部分下,有“ BIS职位编号:”,因此我要提取的信息是在输入职位编号然后单击“执行”后从新页面提取的。

For example, from the dataset https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2 , I pick job number 220286232, and then go to the first website, put the number in "BIS Job Number:" and click go. 例如,从数据集https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2中 ,我选择工作编号220286232,然后转到第一个网站,将其放入“ BIS职位编号:”,然后单击“执行”。 Now I get a new page . 现在我得到一个新页面。 The information i want is "Applicant of Record Information" (including applicant contact information). 我想要的信息是“申请人记录信息”(包括申请人联系信息)。

I'm stuck here. 我被困在这里。 How can I extract these applicant information under each job number? 如何提取每个职位编号下的这些申请人信息?

I am very new to web scraping. 我对网页抓取非常陌生。 I learned how I can extract information from entire page by using rvest, but I'm not familiar with web scraping across different websites. 我了解了如何使用rvest从整个页面中提取信息,但是我不熟悉跨不同网站进行的网络抓取。

Thank you. 谢谢。

Update: I tried to use Socrata API, but I found the applicant contact information doesn't have their own API fields.If there is no API field for the information (but other information on that page has fields), does it mean I can't use API to solve this problem? 更新:我尝试使用Socrata API,但是我发现申请人的联系信息没有自己的API字段,如果该信息没有API字段(但是该页面上的其他信息都有字段),是否意味着我可以使用API​​来解决这个问题?

Thank you! 谢谢!

On that page , top right, click on the "API" tab. 在该页面的右上角,单击“ API”选项卡。 A new modal dialog box will pop up "Access this Dataset via SODA API", copy the link, in this case https://data.cityofnewyork.us/resource/rvhx-8trz.json . 一个新的模式对话框将弹出“通过SODA API访问此数据集”,复制链接,在本例中为https://data.cityofnewyork.us/resource/rvhx-8trz.json This is an URL which provides the data directly in the machine-readable JSON format. 这是一个直接以机器可读的JSON格式提供数据的URL。 But only 1000 records at a time will be fetched. 但是一次只能获取1000条记录。

So maybe add appropriate $offset parameters. 因此,也许添加适当的$offset参数。 See the Socrata documentation . 请参阅Socrata文档 The City of New York seems to use this software for their Open Data platform. 纽约市似乎将此软件用于其开放数据平台。

Maybe call them this way in your R script : 也许在您的R脚本中这样称呼他们:

https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=0
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=500
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=1000
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=...

(untested for higher offsets) (未试过以获得更大的偏移量)

Use jsonlite for converting JSON into R data frames. 使用jsonlite将JSON转换为R数据帧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM