我如何从R中的此网站通过Web抓取信息？

Question

This website http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp is for searching nyc building application information. 该网站http://a810-bisweb.nyc.gov/bisweb/bispi00.jsp用于搜索nyc建筑物应用程序信息。 Under the "Application Searches" section, there is "BIS Job Number:", so the information I want to extract is from the new page after I enter a job number and then click "go". 在“应用程序搜索”部分下，有“ BIS职位编号：”，因此我要提取的信息是在输入职位编号然后单击“执行”后从新页面提取的。

For example, from the dataset https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2 , I pick job number 220286232, and then go to the first website, put the number in "BIS Job Number:" and click go. 例如，从数据集https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2中，我选择工作编号220286232，然后转到第一个网站，将其放入“ BIS职位编号：”，然后单击“执行”。 Now I get a new page . 现在我得到一个新页面。 The information i want is "Applicant of Record Information" (including applicant contact information). 我想要的信息是“申请人记录信息”（包括申请人联系信息）。

I'm stuck here. 我被困在这里。 How can I extract these applicant information under each job number? 如何提取每个职位编号下的这些申请人信息？

I am very new to web scraping. 我对网页抓取非常陌生。 I learned how I can extract information from entire page by using rvest, but I'm not familiar with web scraping across different websites. 我了解了如何使用rvest从整个页面中提取信息，但是我不熟悉跨不同网站进行的网络抓取。

Thank you. 谢谢。

Update: I tried to use Socrata API, but I found the applicant contact information doesn't have their own API fields.If there is no API field for the information (but other information on that page has fields), does it mean I can't use API to solve this problem? 更新：我尝试使用Socrata API，但是我发现申请人的联系信息没有自己的API字段，如果该信息没有API字段（但是该页面上的其他信息都有字段），是否意味着我可以使用API来解决这个问题？

Thank you! 谢谢！

Answer 1

On that page , top right, click on the "API" tab. 在该页面的右上角，单击“ API”选项卡。 A new modal dialog box will pop up "Access this Dataset via SODA API", copy the link, in this case https://data.cityofnewyork.us/resource/rvhx-8trz.json . 一个新的模式对话框将弹出“通过SODA API访问此数据集”，复制链接，在本例中为https://data.cityofnewyork.us/resource/rvhx-8trz.json 。 This is an URL which provides the data directly in the machine-readable JSON format. 这是一个直接以机器可读的JSON格式提供数据的URL。 But only 1000 records at a time will be fetched. 但是一次只能获取1000条记录。

So maybe add appropriate $offset parameters. 因此，也许添加适当的$offset参数。 See the Socrata documentation . 请参阅Socrata文档。 The City of New York seems to use this software for their Open Data platform. 纽约市似乎将此软件用于其开放数据平台。

Maybe call them this way in your R script : 也许在您的R脚本中这样称呼他们：

https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=0
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=500
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=1000
https://data.cityofnewyork.us/resource/rvhx-8trz.json?$offset=...

(untested for higher offsets) （未试过以获得更大的偏移量）

Use jsonlite for converting JSON into R data frames. 使用jsonlite将JSON转换为R数据帧。

我如何从R中的此网站通过Web抓取信息？

问题描述

1 个解决方案

解决方案1
0 2017-09-03 08:23:16

我如何从R中的此网站通过Web抓取信息？

问题描述

1 个解决方案

解决方案1 0 2017-09-03 08:23:16

解决方案1
0 2017-09-03 08:23:16