简体   繁体   English

如何在R中刮取javascript表?

[英]How to scrape javascript table in R?

I want to scrape a table from the citibike : https://s3.amazonaws.com/tripdata/index.html 我想从citibike中刮一张桌子: https ://s3.amazonaws.com/tripdata/index.html

My goal is to get the urls of the zip files all at once, instead of manually type all the dates and downloading one at each time. 我的目标是一次性获取zip文件的URL,而不是手动键入所有日期并每次下载一个。 Since the webpage is updated monthly, every time I run the function, I want be able to get all the up-to-date data files. 由于网页每月更新一次,每次运行该功能时,我都希望能够获取所有最新的数据文件。

I first tried to use Rvest and XML packages and then realized that the webpage contains both the html and a table that's generated by a javascript function. 我首先尝试使用Rvest和XML包,然后意识到网页包含html和由javascript函数生成的表。 That's where the problem was. 这就是问题所在。

Really appreciate any help and please let me know if I could provide further information. 非常感谢任何帮助,如果我能提供更多信息,请告诉我。

If I go to https://s3.amazonaws.com/tripdata/ (just the root, no index.html ) I get a simple XML file. 如果我转到https://s3.amazonaws.com/tripdata/ (只是root,没有index.html ),我会得到一个简单的XML文件。 The relevant element is Key (uppercase K, lowercase e,y) if you want to parse the XML but I would just search the plain text, that is: ignore the XML, treat it like a simple text file, get every string between <Key> and </Key> treat that as the filename that it is and prefix https://s3.amazonaws.com/tripdata/ to get it. 如果要解析XML,则相关元素为Key (大写K,小写e,y),但我只搜索纯文本,即:忽略XML,将其视为简单的文本文件,获取<Key></Key>将其视为文件名,并使用前缀https://s3.amazonaws.com/tripdata/来获取它。

The first entry is all together (170 MB) as it seems, so you might be ok with that alone. 第一个条目就像它看起来一样(170 MB),所以你可能只对它有好处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM