簡體   English   中英

通過HTTPS將多個CSV文件導入R

[英]Import multiple CSV files via HTTPS into R

我正在嘗試通過HTTPS(從Google Drive Sheets)導入多個CSV文件到R.

這是我使用RCurl導入一個CSV文件所做的工作(有效):

#Load packages
require(RCurl)
require(plyr)

x <- getURL("https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv")
x <- read.csv(textConnection(x), header = TRUE, stringsAsFactors = FALSE, skip=1)

然后,我創建了一個名為“hashtags”的數據框,其URL包含12個CSV文件及其名稱,以便導入所有文件。 這是前六行主題標簽

> head(hashtags)
name             url
1 #capstoneisfun https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv
2 #CEP810        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFlQS2FPNzJsdS1TMVBuTHlQTS1FRnc&output=csv
3 #CEP811        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDhLcEI1a0U1T0I0Zm5RaU5UVWdmdlE&output=csv
4 #CEP812        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDJzMjZhN2pGa29QYU5weVhZdjRKdmc&output=csv
5 #CEP813        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdGpJa0VMTmJNdzZ4UjBvUEx5cWsycEE&output=csv
6 #CEP815        https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFB2R0czWjJ2SU9HQWR5VUVuODk3R0E&output=csv

我想要做的是將所有文件導入為數據幀。 我知道應用函數或for循環可以做到這一點,但兩者都超出了我目前的能力。

這是使用一個偉大的地方curl()封裝,提供“一個下拉更換為url() ”以https工作:

library(curl)

urls <- c(
  "https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv",
  "https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFlQS2FPNzJsdS1TMVBuTHlQTS1FRnc&output=csv"
)

cons <- lapply(urls, curl)
lapply(cons, read.csv, stringsAsFactors = FALSE, skip = 1)

這里有一個使用httr (它改進了RCurl並且在Windows上有更好的時間)和data.table的rbindlist所以你得到一個結果data.table,一個對象中的所有推文和主題標簽與必須通過一個列表。 只使用dplyr,因為它是我現在每天使用的東西。 可以輕松刪除和替換基本操作與%>%

library(httr)
library(dplyr)

hashtags <- read.table(text="hashtag,url
#capstoneisfun,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDFLWXZXb08wMVIzY3JrX2tNU2dROEE&output=csv
#CEP810,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFlQS2FPNzJsdS1TMVBuTHlQTS1FRnc&output=csv
#CEP811,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDhLcEI1a0U1T0I0Zm5RaU5UVWdmdlE&output=csv
#CEP812,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdDJzMjZhN2pGa29QYU5weVhZdjRKdmc&output=csv
#CEP813,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdGpJa0VMTmJNdzZ4UjBvUEx5cWsycEE&output=csv
#CEP815,https://docs.google.com/spreadsheet/pub?key=0AsDUegPJ1ngvdFB2R0czWjJ2SU9HQWR5VUVuODk3R0E&output=csv", 
                       stringsAs=FALSE, header=TRUE, sep=",", comment.char="")

tweets <- data.table::rbindlist(by(hashtags, hashtags$hashtag, function(x) {
  doc <- GET(x$url)
  dat <- read.csv(textConnection(content(doc, as="text")), header=TRUE, stringsAs=FALSE, sep=",", skip=1)
  dat <- dat %>% mutate(hashtag=x$hashtag)
  dat  
}))

nrow(tweets)
## [1] 1618

glimpse(tweets)

## Variables:
## $ Date         (chr) "12/12/2014 21:51:49", "11/19/2014 10:17:39", "11/16/2014 4:2...
## $ Twitter.User (chr) "https://twitter.com/matthewkoehler/status/543440594446868481...
## $ Followers    (int) 946, 895, 399, 12, 153, 881, 216, 865, 395, 12, 82, 857, 393,...
## $ Follows      (int) 994, 907, 1174, 24, 114, 887, 492, 869, 1148, 24, 201, 855, 1...
## $ Retweets     (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0...
## $ Favorites    (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ Tweet.Text   (chr) "#capstoneisfun Awesome TA of the Week is @spgreenhalgh ! htt...
## $ hashtag      (chr) "#capstoneisfun", "#capstoneisfun", "#capstoneisfun", "#capst...

tweets$hashtag %>% unique

## [1] "#capstoneisfun" "#CEP810"        "#CEP811"        "#CEP812"       
## [5] "#CEP813"        "#CEP815"       

也許:

dfList <-list()
for( i in 1:nrow(hashtags) ){ 
   x <- getURL( hashtags[i, "url"] )
   dfList[[ hashtags[i,1] ]] <- read.csv(textConnection(x), header = TRUE, 
                                         stringsAsFactors = FALSE, skip=1)
                            }

似乎是成功的(雖然我認為不需要加載pkg :: plyr並且代碼在沒有這樣做的情況下進行了測試。)str(dfList)的輸出頂部:

str(dfList)
List of 6
 $ #capstoneisfun:'data.frame': 63 obs. of  7 variables:
  ..$ Date        : chr [1:63] "12/12/2014 21:51:49" "11/19/2014 10:17:39" "11/16/2014 4:29:39" "11/14/2014 5:44:57" ...
  ..$ Twitter.User: chr [1:63] "https://twitter.com/matthewkoehler/status/543440594446868481" "https://twitter.com/matthewkoehler/status/534930982802321408" "https://twitter.com/spgreenhalgh/status/533756240837771265" "https://twitter.com/sarahfkeenan/status/533050416087715840" ...
  ..$ Followers   : int [1:63] 946 895 399 12 153 881 216 865 395 12 ...
  ..$ Follows     : int [1:63] 994 907 1174 24 114 887 492 869 1148 24 ...
  ..$ Retweets    : int [1:63] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ Favorites   : int [1:63] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ Tweet.Text  : chr [1:63] "#capstoneisfun Awesome TA of the Week is @spgreenhalgh ! http://t.co/fbKqtHAhcl" "Module 12 is beginning! #capstoneisfun" "Had a fantastic time with #capstoneisfun students today in exhibitions! So fun to see everyone's portfolios as they're finishin"| __truncated__ "@emstrazz, your intended audience can 
 # snipped rest

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM