简体   繁体   English

如何将网站上的画面数据抓取到 R 中?

[英]How do I scrape tableau data from website into R?

I'm working on a project that currently requires me to visit this website ( https://returntogrounds.virginia.edu/covid-tracker ) every day, and manually add each new day's date and UVA positive cases value to a data frame.我正在从事一个项目,目前需要我每天访问此网站 ( https://returntogrounds.virginia.edu/covid-tracker ),并手动将每个新日期的dateUVA positive cases值添加到数据框中。 Is there a code I can run in R that would create a data frame of date and UVA positive cases rather than me having to manually add the new data every day?是否有我可以在 R 中运行的代码来创建dateUVA positive cases的数据框,而不必每天手动添加新数据? I see that there is a similar question here but this is for python which I am unfamiliar with.我看到这里有一个类似的问题,但这是针对我不熟悉的 python 的。

You will need to get the tableau URL which is :您将需要获取以下画面 URL:

https://public.tableau.com/views/UVACOVIDTracker/Summary?&:embed=y&:showVizHome=no https://public.tableau.com/views/UVACOVIDTracker/Summary?&:embed=y&:showVizHome=no

From there, you need to execute the following flow (same as this post ):从那里,您需要执行以下流程(与这篇文章相同):

  • call the following url :调用以下网址:

     GET https://public.tableau.com/views/S07StuP58/Dashboard1?:embed=y&:showVizHome=no
  • extract the JSON content from the textarea with id tsConfigContainer从 id 为tsConfigContainertextarea提取 JSON 内容

  • build the url with the session_id使用 session_id 构建 url

     POST https://public.tableau.com/{vizql_path}/bootstrapSession/sessions/{session_id}
  • extract the JSON data from the response which is not JSON originally (regex to split the data)从最初不是 JSON 的响应中提取 JSON 数据(正则表达式来拆分数据)

  • extract the data from the large JSON configuration, this is not straightforward since all the strings data are located in a single array.从大型 JSON 配置中提取数据,这并不简单,因为所有字符串数据都位于单个数组中。 You need to get the data indices from various fields in order to be able to split the data into columns and then build your dataframe您需要从各个字段获取数据索引,以便能够将数据拆分为列,然后构建您的数据框

There are many "worksheets" on this view so I've made a script which prompt user to select one, so you can check which one is more convenient for you :此视图中有许多“工作表”,因此我制作了一个脚本来提示用户选择一个,以便您可以检查哪个对您更方便:

library(rvest)
library(rjson)
library(httr)
library(stringr)

#replace the hostname and the path if necessary
host_url <- "https://public.tableau.com"
path <- "/views/UVACOVIDTracker/Summary"

body <- read_html(modify_url(host_url, 
                             path = path, 
                             query = list(":embed" = "y",":showVizHome" = "no")
))

data <- body %>% 
  html_nodes("textarea#tsConfigContainer") %>% 
  html_text()
json <- fromJSON(data)

url <- modify_url(host_url, path = paste(json$vizql_root, "/bootstrapSession/sessions/", json$sessionid, sep =""))

resp <- POST(url, body = list(sheet_id = json$sheetId), encode = "form")
data <- content(resp, "text")

extract <- str_match(data, "\\d+;(\\{.*\\})\\d+;(\\{.*\\})")
info <- fromJSON(extract[1,1])
data <- fromJSON(extract[1,3])

worksheets = names(data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap)

for(i in 1:length(worksheets)){
  print(paste("[",i,"] ",worksheets[i], sep=""))
}
selected <-  readline(prompt="select worksheet by index: ");
worksheet <- worksheets[as.integer(selected)]
print(paste("you selected :", worksheet, sep=" "))

columnsData <- data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap[[worksheet]]$presModelHolder$genVizDataPresModel$paneColumnsData

i <- 1
result <- list();
for(t in columnsData$vizDataColumns){
  if (is.null(t[["fieldCaption"]]) == FALSE) {
    paneIndex <- t$paneIndices
    columnIndex <- t$columnIndices
    if (length(t$paneIndices) > 1){
      paneIndex <- t$paneIndices[1]
    }
    if (length(t$columnIndices) > 1){
      columnIndex <- t$columnIndices[1]
    }
    result[[i]] <- list(
      fieldCaption = t[["fieldCaption"]], 
      valueIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$valueIndices,
      aliasIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$aliasIndices, 
      dataType = t[["dataType"]],
      stringsAsFactors = FALSE
    )
    i <- i + 1
  }
}
dataFull = data$secondaryInfo$presModelMap$dataDictionary$presModelHolder$genDataDictionaryPresModel$dataSegments[["0"]]$dataColumns

cstring <- list();
for(t in dataFull) {
  if(t$dataType == "cstring"){
    cstring <- t
    break
  }
}
data_index <- 1
name_index <- 1
frameData <-  list()
frameNames <- c()
for(t in dataFull) {
  for(index in result) {
    if (t$dataType == index["dataType"]){
      if (length(index$valueIndices) > 0) {
        j <- 1
        vector <- character(length(index$valueIndices))
        for (it in index$valueIndices){
          vector[j] <- t$dataValues[it+1]
          j <- j + 1
        }
        frameData[[data_index]] <- vector
        frameNames[[name_index]] <- paste(index$fieldCaption, "value", sep="-")
        data_index <- data_index + 1
        name_index <- name_index + 1
      }
      if (length(index$aliasIndices) > 0) {
        j <- 1
        vector <- character(length(index$aliasIndices))
        for (it in index$aliasIndices){
          if (it >= 0){
            vector[j] <- t$dataValues[it+1]
          } else {
            vector[j] <- cstring$dataValues[abs(it)]
          }
          j <- j + 1
        }
        frameData[[data_index]] <- vector
        frameNames[[name_index]] <- paste(index$fieldCaption, "alias", sep="-")
        data_index <- data_index + 1
        name_index <- name_index + 1
      }
    }
  }
}

df <- NULL
lengthList <- c()
for(i in 1:length(frameNames)){
  lengthList[i] <- length(frameData[[i]])
}
max <- max(lengthList)
for(i in 1:length(frameNames)){
  if (length(frameData[[i]]) < max){
    len <- length(frameData[[i]])
    frameData[[i]][(len+1):max]<-""
  }
  df[frameNames[[i]]] <- frameData[i]
}
options(width = 1200)
df <- as.data.frame(df, stringsAsFactors = FALSE)
print(df)

Contrary to this post the dataType field needs to be the same as the one from the field from presModelHolder$genVizDataPresModel$paneColumnsData (which describes all indices in each column)这篇文章相反, dataType字段需要与来自presModelHolder$genVizDataPresModel$paneColumnsData的字段presModelHolder$genVizDataPresModel$paneColumnsData (描述每列中的所有索引)

Output of this script:此脚本的输出:

Loading required package: xml2
[1] "[1] Active inpatient"
[1] "[2] Employee tests 2 weeks ago"
[1] "[3] Employee tests last week"
[1] "[4] Hosp all line"
[1] "[5] Hosp yesterday"
[1] "[6] Pos all UVA count line"
[1] "[7] Pos all UVA total"
[1] "[8] Pos student count line"
[1] "[9] Pos student total"
[1] "[10] Resources"
[1] "[11] Room isolation bar"
[1] "[12] Room quarantine bar"
[1] "[13] Student cases yesterday"
[1] "[14] Student new case 10-day total"
[1] "[15] Student test last week"
[1] "[16] Student tests 2 weeks ago"
[1] "[17] Tests UVA Lab TAT"
[1] "[18] Title"
[1] "[19] UVA 2 weeks ago"
[1] "[20] UVA Cases 10 subtotal"
[1] "[21] UVA Cases yesterday"
[1] "[22] UVA tests - last week"
[1] "[23] avg cases - 2 wks ago"
[1] "[24] avg cases - 3 wks ago"
[1] "[25] avg cases - last wk"
[1] "[26] avg new cases - this week"
[1] "[27] avg student cases - 2 weeks ago"
[1] "[28] avg student cases - 3 weeks ago"
[1] "[29] avg student cases - last week"
[1] "[30] avg student cases - this week"
select worksheet by index: 6
[1] "you selected : Pos all UVA count line"
   X.Calculation_246290626693455872..value X.Event_Date..value
1                                       29 2020-10-01 00:00:00
2                                       33 2020-09-30 00:00:00
3                                       45 2020-09-29 00:00:00
4                                        4 2020-09-28 00:00:00
5                                       17 2020-09-27 00:00:00
6                                       23 2020-09-26 00:00:00
7                                       41 2020-09-25 00:00:00
..............................................................
40                                       2 2020-08-23 00:00:00
41                                       5 2020-08-22 00:00:00
42                                       3 2020-08-21 00:00:00
43                                       5 2020-08-20 00:00:00
44                                       3 2020-08-19 00:00:00
45                                       4 2020-08-18 00:00:00
46                                       4 2020-08-17 00:00:00

I've notived that the worksheet that would work would be "Pos all UVA count line" and "Pos student count line"我注意到可以使用的工作表是“Pos all UVA count line”和“Pos student count line”

The same script written in :编写的相同脚本:

import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd

#replace the hostname and the path if necessary
host_url = "https://public.tableau.com"
path = "/views/UVACOVIDTracker/Summary"

url = f"{host_url}{path}"

r = requests.get(
    url,
    params= {
        ":embed": "y",
        ":showVizHome": "no"
    }
) 
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'{host_url}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

worksheets = list(data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"].keys())

for idx, ws in enumerate(worksheets):
    print(f"[{idx}] {ws}")

selected = input("select worksheet by index: ")
worksheet = worksheets[int(selected)]
print(f"you selected : {worksheet}")

columnsData = data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"][worksheet]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]
result = [ 
    {
        "fieldCaption": t.get("fieldCaption", ""), 
        "valueIndices": columnsData["paneColumnsList"][t["paneIndices"][0]]["vizPaneColumns"][t["columnIndices"][0]]["valueIndices"],
        "aliasIndices": columnsData["paneColumnsList"][t["paneIndices"][0]]["vizPaneColumns"][t["columnIndices"][0]]["aliasIndices"],
        "dataType": t.get("dataType"),
        "paneIndices": t["paneIndices"][0],
        "columnIndices": t["columnIndices"][0]
    }
    for t in columnsData["vizDataColumns"]
    if t.get("fieldCaption")
]
dataFull = data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"]

def onAlias(it, value, cstring):
    return value[it] if (it >= 0) else cstring["dataValues"][abs(it)-1]

frameData = {}
cstring = [t for t in dataFull if t["dataType"] == "cstring"][0]
for t in dataFull:
    for index in result:
        if (t["dataType"] == index["dataType"]):
            if len(index["valueIndices"]) > 0:
                frameData[f'{index["fieldCaption"]}-value'] = [t["dataValues"][abs(it)] for it in index["valueIndices"]]
            if len(index["aliasIndices"]) > 0:
                frameData[f'{index["fieldCaption"]}-alias'] = [onAlias(it, t["dataValues"], cstring) for it in index["aliasIndices"]]

df = pd.DataFrame.from_dict(frameData, orient='index').fillna(0).T
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', 1000):
    print(df)

Try this on repl.it 在 repl.it 上试试这个

Edit: I've improved the scripts to include the alias values which gives more data编辑:我改进了脚本以包含提供更多数据的别名值

I've made a repo including Python and R script here我做了包括Python和R脚本回购这里

Lookup rvest/xml2 for scraping parseable HTML.查找 rvest/xml2 以抓取可解析的 HTML。 Unfortunately, with Tableau/PowerBI applications, this is not straight-forward.不幸的是,对于 Tableau/PowerBI 应用程序,这不是直截了当的。 With pages such as this with built objects, accessing the underlying data is preferable.对于此类带有构建对象的页面,最好访问底层数据。

The other answer you highlight is on the right track.您强调的另一个答案是正确的。 Get the JSON formatted data (usually from an API request) and extract the values you want.获取 JSON 格式的数据(通常来自 API 请求)并提取您想要的值。 However, another problem you will find is that the session ID is not persistent.但是,您会发现另一个问题是会话 ID 不是持久的。 You may need to capture all the XHR objects when you visit the page's URL and then go through some messy logic to identify the right object.当您访问页面的 URL 时,您可能需要捕获所有 XHR 对象,然后通过一些混乱的逻辑来识别正确的对象。

(If you need to view all the resources accessed in the page visit, press F12 in your browser, and go to the 'Network' tab.) (如果您需要查看页面访问中访问的所有资源,请在浏览器中按 F12,然后转到“网络”选项卡。)

At this stage, it probably wouldn't hurt to ask the Tableau authors if the API is publicly available, or if they can offer a dataset download capability in the report.在这个阶段,询问 Tableau 作者 API 是否公开可用,或者他们是否可以在报告中提供数据集下载功能可能不会有什么坏处。

Good luck.祝你好运。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM