[英]Using xpath in R to scrape data from website with multiple similar paths
我試圖在 R 中抓取待售公寓列表和該網站的基本信息(地址、m2、價格、房間等): https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending= true&priceMin=3000000&priceMax=7000000 (另見下面的頁面截圖 + 檢查)
使用 SelectorGadget,我無法創建一個唯一提取第 1 頁上所有 50 間公寓的平方米的路徑,以及另一個提取房間數量等的唯一路徑。
我確實設法找到了唯一提取地址的路徑(請參見下面的代碼塊)。 但這是與文本的 rest 不同的塊/類。
這是我當前的代碼:
library(rvest)
library(dplyr)
link = "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000&page=1"
page = read_html(link)
address = page %>% html_nodes("div.mr-2") %>% html_text()
price = #MISSING - CAN'T FIGURE OUT
sqm = #MISSING - CAN'T FIGURE OUT
rooms = #MISSING - CAN'T FIGURE OUT
forsale = data.frame(address, price, sqm, rooms, stringsAsFactors = FALSE)
關於如何處理它的任何想法? 我也嘗試使用 xpath 來提取 sqm,但只設法提取了一個特定的文本字段,而不是頁面上的全部 50 個。
也歡迎使用其他方法。 提前致謝!
使用他們的 API(在 .network 部分中找到),您可以調用它並檢索信息,如下所示:
library(tidyverse)
library(httr2)
"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases") %>%
unnest(address, names_sep = "_") %>%
mutate(
address = str_c(address_roadName, address_houseNumber, address_zipCode, sep = " "),
.before = 1
) %>%
select(address,
price = priceCash,
sqm = housingArea,
rooms = numberOfRooms)
# A tibble: 100 × 4
address price sqm rooms
<chr> <int> <int> <int>
1 Holsteinsgade 66 2100 3135000 56 2
2 Tuborgvej 60 2900 4875000 114 4
3 Poppellunden 8 4000 3350000 92 3
4 Hyldegårds Tværvej 5 2920 6498000 115 3
5 Grollowstræde 3 3000 3495000 92 3
6 Rasmus Rasks Vej 8 2500 3995000 80 3
7 Ryesgade 7 8000 4598000 110 4
8 Carl Th. Zahles Gade 8 2300 5795000 113 3
9 Strandlodsvej 23E 2300 5495000 101 3
10 Nordre Fasanvej 162 2000 4695000 90 4
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
哪些變量可用於提取:
"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases") %>%
glimpse
Rows: 100
Columns: 37
$ `_links` <df[,1]> <data.frame[30 x 1]>
$ address <df[,28]> <data.frame[30 x 28]>
$ addressType <chr> "condo", "condo", "condo", "condo", "condo", "condo", "c…
$ caseID <chr> "89194273-5948-4734-8085-fec9d42ac3c2", "ff6a9ff5-eacf-…
$ caseUrl <chr> "https://www.lokalbolig.dk/?sag=26-X0001820", "https://www.…
$ coordinates <df[,3]> <data.frame[30 x 3]>
$ daysOnMarket <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ defaultImage <df[,1]> <data.frame[30 x 1]>
$ descriptionBody <chr> "Lys stuelejlighed med to terrasser i HørsholmNær centrum o…
$ descriptionTitle <chr> "Lys stuelejlighed med to terrasser i Hørsholm", "Fantas…
$ distinction <chr> "real_estate", "real_estate", "real_estate", "real_estate",…
$ energyLabel <chr> "c", "c", "d", "c", "d", "c", "c", "c", "c", "c", "c", "…
$ highlighted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ housingArea <int> 98, 82, 64, 91, 81, 97, 78, 113, 81, 91, 133, 69, 80, 64, 1…
$ images <list> [<data.frame[5 x 1]>], [<data.frame[3 x 1]>], [<data.frame[…
$ monthlyExpense <int> 4183, 3888, 2798, 3205, 3557, 3405, 3233, 2688, 3921, 3907,…
$ nextOpenHouse <df[,4]> <data.frame[30 x 4]>
$ numberOfFloors <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
$ numberOfRooms <int> 3, 3, 2, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 2, 4, 2, 3, 4, 2, 4,…
$ pageViews <int> 126, 341, 191, 160, 358, 356, 242, 516, 133, 180, 134, 106…
$ perAreaPrice <int> 40765, 54817, 62422, 71374, 43148, 58711, 60897, 41150, 480…
$ priceCash <int> 3995000, 4495000, 3995000, 6495000, 3495000, 5695000, 47…
$ priceChangePercentage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ providerCaseID <chr> "26-X000182018025lok", "114-2102", "43000000643cam", "13433…
$ realEstate <df[,3]> <data.frame[30 x 3]>
$ realtor <df[,21]> <data.frame[30 x 21]>
$ slug <chr> "oerbaekgaards-alle-901-0-tv-2970-hoersholm-02239600_901_st…
$ status <chr> "open", "open", "open", "open", "open", "open", "open", "op…
$ timeOnMarket <df[,2]> <data.frame[30 x 2]>
$ totalClickCount <int> 103, 274, 109, 121, 227, 273, 205, 415, 82, 128, 122, 92, 1…
$ totalFavourites <int> 1, 3, 0, 0, 4, 1, 1, 3, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2,…
$ utilitiesConnectionFee <df[,1]> <data.frame[30 x 1]>
$ yearBuilt <int> 2002, 1886, 1907, 2008, 1932, 1914, 1900, 1926, 1934, 1932,…
$ basementArea <int> NA, NA, NA, NA, NA, NA, NA, NA, 88, NA, NA, NA, NA, NA, …
$ lotArea <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5327, NA, N…
$ weightedArea <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ secondaryAddressType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
如何將數據保存到您的環境中
df <- "https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases")
選擇器有點復雜和脆弱,但現在它似乎可以工作:
library(rvest)
library(dplyr)
library(purrr)
library(stringr)
url <- "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000"
html <- read_html(url)
html |> html_elements("div.shadow.overflow-hidden.mx-4") |>
map_dfr(\(x)
list(
"address" = html_element(x ,"div.mr-2") |> html_text2() |> str_squish(),
"price" = html_element(x ,"span.text-lg.pr-2") |> html_text(),
"sqm" = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(1) .text-sm" ) |> html_text(),
"rooms" = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(4) .text-sm" ) |> html_text()
)
)
#> # A tibble: 50 × 4
#> address price sqm rooms
#> <chr> <chr> <chr> <chr>
#> 1 Poppellunden 8, 4. tv. Himmelev, 4000 Roskilde 3.350.000 kr. 92 m² 3 Vær.
#> 2 Tuborgvej 60, 2. th. 2900 Hellerup 4.875.000 kr. 114 m² 4 Vær.
#> 3 Hyldegårds Tværvej 5, st. tv. 2920 Charlottenlund 6.498.000 kr. 115 m² 3 Vær.
#> 4 Grollowstræde 3 3000 Helsingør 3.495.000 kr. 92 m² 3 Vær.
#> 5 Ryesgade 7, 2. tv. 8000 Aarhus C 4.598.000 kr. 110 m² 4 Vær.
#> 6 Carl Th. Zahles Gade 8, 2. tv. 2300 København S 5.795.000 kr. 113 m² 3 Vær.
#> 7 Rasmus Rasks Vej 8, 2. tv. 2500 Valby 3.995.000 kr. 80 m² 3 Vær.
#> 8 Strandlodsvej 23E, 1. mf. 2300 København S 5.495.000 kr. 101 m² 3 Vær.
#> 9 Nordre Fasanvej 162, 3. th. 2000 Frederiksberg 4.695.000 kr. 90 m² 4 Vær.
#> 10 Ringstedgade 17B, 1. th. 4000 Roskilde 5.395.000 kr. 137 m² 5 Vær.
#> # … with 40 more rows
創建於 2023-02-01,使用reprex v2.0.2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.