[英]How do I scrape tableau data from website into R?
我正在从事一个项目,目前需要我每天访问此网站 ( https://returntogrounds.virginia.edu/covid-tracker ),并手动将每个新日期的date
和UVA positive cases
值添加到数据框中。 是否有我可以在 R 中运行的代码来创建date
和UVA positive cases
的数据框,而不必每天手动添加新数据? 我看到这里有一个类似的问题,但这是针对我不熟悉的 python 的。
您将需要获取以下画面 URL:
https://public.tableau.com/views/UVACOVIDTracker/Summary?&:embed=y&:showVizHome=no
从那里,您需要执行以下流程(与这篇文章相同):
调用以下网址:
GET https://public.tableau.com/views/S07StuP58/Dashboard1?:embed=y&:showVizHome=no
从 id 为tsConfigContainer
的textarea
提取 JSON 内容
使用 session_id 构建 url
POST https://public.tableau.com/{vizql_path}/bootstrapSession/sessions/{session_id}
从最初不是 JSON 的响应中提取 JSON 数据(正则表达式来拆分数据)
从大型 JSON 配置中提取数据,这并不简单,因为所有字符串数据都位于单个数组中。 您需要从各个字段获取数据索引,以便能够将数据拆分为列,然后构建您的数据框
此视图中有许多“工作表”,因此我制作了一个脚本来提示用户选择一个,以便您可以检查哪个对您更方便:
library(rvest)
library(rjson)
library(httr)
library(stringr)
#replace the hostname and the path if necessary
host_url <- "https://public.tableau.com"
path <- "/views/UVACOVIDTracker/Summary"
body <- read_html(modify_url(host_url,
path = path,
query = list(":embed" = "y",":showVizHome" = "no")
))
data <- body %>%
html_nodes("textarea#tsConfigContainer") %>%
html_text()
json <- fromJSON(data)
url <- modify_url(host_url, path = paste(json$vizql_root, "/bootstrapSession/sessions/", json$sessionid, sep =""))
resp <- POST(url, body = list(sheet_id = json$sheetId), encode = "form")
data <- content(resp, "text")
extract <- str_match(data, "\\d+;(\\{.*\\})\\d+;(\\{.*\\})")
info <- fromJSON(extract[1,1])
data <- fromJSON(extract[1,3])
worksheets = names(data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap)
for(i in 1:length(worksheets)){
print(paste("[",i,"] ",worksheets[i], sep=""))
}
selected <- readline(prompt="select worksheet by index: ");
worksheet <- worksheets[as.integer(selected)]
print(paste("you selected :", worksheet, sep=" "))
columnsData <- data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap[[worksheet]]$presModelHolder$genVizDataPresModel$paneColumnsData
i <- 1
result <- list();
for(t in columnsData$vizDataColumns){
if (is.null(t[["fieldCaption"]]) == FALSE) {
paneIndex <- t$paneIndices
columnIndex <- t$columnIndices
if (length(t$paneIndices) > 1){
paneIndex <- t$paneIndices[1]
}
if (length(t$columnIndices) > 1){
columnIndex <- t$columnIndices[1]
}
result[[i]] <- list(
fieldCaption = t[["fieldCaption"]],
valueIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$valueIndices,
aliasIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$aliasIndices,
dataType = t[["dataType"]],
stringsAsFactors = FALSE
)
i <- i + 1
}
}
dataFull = data$secondaryInfo$presModelMap$dataDictionary$presModelHolder$genDataDictionaryPresModel$dataSegments[["0"]]$dataColumns
cstring <- list();
for(t in dataFull) {
if(t$dataType == "cstring"){
cstring <- t
break
}
}
data_index <- 1
name_index <- 1
frameData <- list()
frameNames <- c()
for(t in dataFull) {
for(index in result) {
if (t$dataType == index["dataType"]){
if (length(index$valueIndices) > 0) {
j <- 1
vector <- character(length(index$valueIndices))
for (it in index$valueIndices){
vector[j] <- t$dataValues[it+1]
j <- j + 1
}
frameData[[data_index]] <- vector
frameNames[[name_index]] <- paste(index$fieldCaption, "value", sep="-")
data_index <- data_index + 1
name_index <- name_index + 1
}
if (length(index$aliasIndices) > 0) {
j <- 1
vector <- character(length(index$aliasIndices))
for (it in index$aliasIndices){
if (it >= 0){
vector[j] <- t$dataValues[it+1]
} else {
vector[j] <- cstring$dataValues[abs(it)]
}
j <- j + 1
}
frameData[[data_index]] <- vector
frameNames[[name_index]] <- paste(index$fieldCaption, "alias", sep="-")
data_index <- data_index + 1
name_index <- name_index + 1
}
}
}
}
df <- NULL
lengthList <- c()
for(i in 1:length(frameNames)){
lengthList[i] <- length(frameData[[i]])
}
max <- max(lengthList)
for(i in 1:length(frameNames)){
if (length(frameData[[i]]) < max){
len <- length(frameData[[i]])
frameData[[i]][(len+1):max]<-""
}
df[frameNames[[i]]] <- frameData[i]
}
options(width = 1200)
df <- as.data.frame(df, stringsAsFactors = FALSE)
print(df)
与这篇文章相反, dataType
字段需要与来自presModelHolder$genVizDataPresModel$paneColumnsData
的字段presModelHolder$genVizDataPresModel$paneColumnsData
(描述每列中的所有索引)
此脚本的输出:
Loading required package: xml2
[1] "[1] Active inpatient"
[1] "[2] Employee tests 2 weeks ago"
[1] "[3] Employee tests last week"
[1] "[4] Hosp all line"
[1] "[5] Hosp yesterday"
[1] "[6] Pos all UVA count line"
[1] "[7] Pos all UVA total"
[1] "[8] Pos student count line"
[1] "[9] Pos student total"
[1] "[10] Resources"
[1] "[11] Room isolation bar"
[1] "[12] Room quarantine bar"
[1] "[13] Student cases yesterday"
[1] "[14] Student new case 10-day total"
[1] "[15] Student test last week"
[1] "[16] Student tests 2 weeks ago"
[1] "[17] Tests UVA Lab TAT"
[1] "[18] Title"
[1] "[19] UVA 2 weeks ago"
[1] "[20] UVA Cases 10 subtotal"
[1] "[21] UVA Cases yesterday"
[1] "[22] UVA tests - last week"
[1] "[23] avg cases - 2 wks ago"
[1] "[24] avg cases - 3 wks ago"
[1] "[25] avg cases - last wk"
[1] "[26] avg new cases - this week"
[1] "[27] avg student cases - 2 weeks ago"
[1] "[28] avg student cases - 3 weeks ago"
[1] "[29] avg student cases - last week"
[1] "[30] avg student cases - this week"
select worksheet by index: 6
[1] "you selected : Pos all UVA count line"
X.Calculation_246290626693455872..value X.Event_Date..value
1 29 2020-10-01 00:00:00
2 33 2020-09-30 00:00:00
3 45 2020-09-29 00:00:00
4 4 2020-09-28 00:00:00
5 17 2020-09-27 00:00:00
6 23 2020-09-26 00:00:00
7 41 2020-09-25 00:00:00
..............................................................
40 2 2020-08-23 00:00:00
41 5 2020-08-22 00:00:00
42 3 2020-08-21 00:00:00
43 5 2020-08-20 00:00:00
44 3 2020-08-19 00:00:00
45 4 2020-08-18 00:00:00
46 4 2020-08-17 00:00:00
我注意到可以使用的工作表是“Pos all UVA count line”和“Pos student count line”
用python编写的相同脚本:
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
#replace the hostname and the path if necessary
host_url = "https://public.tableau.com"
path = "/views/UVACOVIDTracker/Summary"
url = f"{host_url}{path}"
r = requests.get(
url,
params= {
":embed": "y",
":showVizHome": "no"
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{host_url}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
worksheets = list(data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"].keys())
for idx, ws in enumerate(worksheets):
print(f"[{idx}] {ws}")
selected = input("select worksheet by index: ")
worksheet = worksheets[int(selected)]
print(f"you selected : {worksheet}")
columnsData = data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"][worksheet]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]
result = [
{
"fieldCaption": t.get("fieldCaption", ""),
"valueIndices": columnsData["paneColumnsList"][t["paneIndices"][0]]["vizPaneColumns"][t["columnIndices"][0]]["valueIndices"],
"aliasIndices": columnsData["paneColumnsList"][t["paneIndices"][0]]["vizPaneColumns"][t["columnIndices"][0]]["aliasIndices"],
"dataType": t.get("dataType"),
"paneIndices": t["paneIndices"][0],
"columnIndices": t["columnIndices"][0]
}
for t in columnsData["vizDataColumns"]
if t.get("fieldCaption")
]
dataFull = data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"]
def onAlias(it, value, cstring):
return value[it] if (it >= 0) else cstring["dataValues"][abs(it)-1]
frameData = {}
cstring = [t for t in dataFull if t["dataType"] == "cstring"][0]
for t in dataFull:
for index in result:
if (t["dataType"] == index["dataType"]):
if len(index["valueIndices"]) > 0:
frameData[f'{index["fieldCaption"]}-value'] = [t["dataValues"][abs(it)] for it in index["valueIndices"]]
if len(index["aliasIndices"]) > 0:
frameData[f'{index["fieldCaption"]}-alias'] = [onAlias(it, t["dataValues"], cstring) for it in index["aliasIndices"]]
df = pd.DataFrame.from_dict(frameData, orient='index').fillna(0).T
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', 1000):
print(df)
编辑:我改进了脚本以包含提供更多数据的别名值
我做了包括Python和R脚本回购这里
查找 rvest/xml2 以抓取可解析的 HTML。 不幸的是,对于 Tableau/PowerBI 应用程序,这不是直截了当的。 对于此类带有构建对象的页面,最好访问底层数据。
您强调的另一个答案是正确的。 获取 JSON 格式的数据(通常来自 API 请求)并提取您想要的值。 但是,您会发现另一个问题是会话 ID 不是持久的。 当您访问页面的 URL 时,您可能需要捕获所有 XHR 对象,然后通过一些混乱的逻辑来识别正确的对象。
(如果您需要查看页面访问中访问的所有资源,请在浏览器中按 F12,然后转到“网络”选项卡。)
在这个阶段,询问 Tableau 作者 API 是否公开可用,或者他们是否可以在报告中提供数据集下载功能可能不会有什么坏处。
祝你好运。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.