簡體   English   中英

在 R 中使用 xpath 從具有多個相似路徑的網站中抓取數據

[英]Using xpath in R to scrape data from website with multiple similar paths

我試圖在 R 中抓取待售公寓列表和該網站的基本信息(地址、m2、價格、房間等): https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending= true&priceMin=3000000&priceMax=7000000 (另見下面的頁面截圖 + 檢查)

使用 SelectorGadget,我無法創建一個唯一提取第 1 頁上所有 50 間公寓的平方米的路徑,以及另一個提取房間數量等的唯一路徑。

我確實設法找到了唯一提取地址的路徑(請參見下面的代碼塊)。 但這是與文本的 rest 不同的塊/類。

這是我當前的代碼:

library(rvest)
library(dplyr)

link = "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000&page=1"
page = read_html(link)
address = page %>% html_nodes("div.mr-2") %>% html_text()
price = #MISSING - CAN'T FIGURE OUT
sqm = #MISSING - CAN'T FIGURE OUT
rooms = #MISSING - CAN'T FIGURE OUT
forsale = data.frame(address, price, sqm, rooms, stringsAsFactors = FALSE)

關於如何處理它的任何想法? 我也嘗試使用 xpath 來提取 sqm,但只設法提取了一個特定的文本字段,而不是頁面上的全部 50 個。

也歡迎使用其他方法。 提前致謝!

使用他們的 API(在 .network 部分中找到),您可以調用它並檢索信息,如下所示:

library(tidyverse)
library(httr2)

"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
  request() %>%
  req_perform() %>%
  resp_body_json(simplifyVector = TRUE) %>%
  pluck("cases") %>%
  unnest(address, names_sep = "_") %>%
  mutate(
    address = str_c(address_roadName, address_houseNumber, address_zipCode, sep = " "),
    .before = 1
  ) %>%
  select(address,
         price = priceCash,
         sqm = housingArea,
         rooms = numberOfRooms)

# A tibble: 100 × 4
   address                       price   sqm rooms
   <chr>                         <int> <int> <int>
 1 Holsteinsgade 66 2100       3135000    56     2
 2 Tuborgvej 60 2900           4875000   114     4
 3 Poppellunden 8 4000         3350000    92     3
 4 Hyldegårds Tværvej 5 2920   6498000   115     3
 5 Grollowstræde 3 3000        3495000    92     3
 6 Rasmus Rasks Vej 8 2500     3995000    80     3
 7 Ryesgade 7 8000             4598000   110     4
 8 Carl Th. Zahles Gade 8 2300 5795000   113     3
 9 Strandlodsvej 23E 2300      5495000   101     3
10 Nordre Fasanvej 162 2000    4695000    90     4
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows

哪些變量可用於提取:

"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
  request() %>%
  req_perform() %>%
  resp_body_json(simplifyVector = TRUE) %>%
  pluck("cases") %>% 
  glimpse

Rows: 100
Columns: 37
$ `_links`               <df[,1]> <data.frame[30 x 1]>
$ address                <df[,28]> <data.frame[30 x 28]>
$ addressType            <chr> "condo", "condo", "condo", "condo", "condo", "condo", "c…
$ caseID                 <chr> "89194273-5948-4734-8085-fec9d42ac3c2", "ff6a9ff5-eacf-…
$ caseUrl                <chr> "https://www.lokalbolig.dk/?sag=26-X0001820", "https://www.…
$ coordinates            <df[,3]> <data.frame[30 x 3]>
$ daysOnMarket           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ defaultImage           <df[,1]> <data.frame[30 x 1]>
$ descriptionBody        <chr> "Lys stuelejlighed med to terrasser i HørsholmNær centrum o…
$ descriptionTitle       <chr> "Lys stuelejlighed med to terrasser i Hørsholm", "Fantas…
$ distinction            <chr> "real_estate", "real_estate", "real_estate", "real_estate",…
$ energyLabel            <chr> "c", "c", "d", "c", "d", "c", "c", "c", "c", "c", "c", "…
$ highlighted            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ housingArea            <int> 98, 82, 64, 91, 81, 97, 78, 113, 81, 91, 133, 69, 80, 64, 1…
$ images                 <list> [<data.frame[5 x 1]>], [<data.frame[3 x 1]>], [<data.frame[…
$ monthlyExpense         <int> 4183, 3888, 2798, 3205, 3557, 3405, 3233, 2688, 3921, 3907,…
$ nextOpenHouse          <df[,4]> <data.frame[30 x 4]>
$ numberOfFloors         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
$ numberOfRooms          <int> 3, 3, 2, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 2, 4, 2, 3, 4, 2, 4,…
$ pageViews              <int> 126, 341, 191, 160, 358, 356, 242, 516, 133, 180, 134, 106…
$ perAreaPrice           <int> 40765, 54817, 62422, 71374, 43148, 58711, 60897, 41150, 480…
$ priceCash              <int> 3995000, 4495000, 3995000, 6495000, 3495000, 5695000, 47…
$ priceChangePercentage  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ providerCaseID         <chr> "26-X000182018025lok", "114-2102", "43000000643cam", "13433…
$ realEstate             <df[,3]> <data.frame[30 x 3]>
$ realtor                <df[,21]> <data.frame[30 x 21]>
$ slug                   <chr> "oerbaekgaards-alle-901-0-tv-2970-hoersholm-02239600_901_st…
$ status                 <chr> "open", "open", "open", "open", "open", "open", "open", "op…
$ timeOnMarket           <df[,2]> <data.frame[30 x 2]>
$ totalClickCount        <int> 103, 274, 109, 121, 227, 273, 205, 415, 82, 128, 122, 92, 1…
$ totalFavourites        <int> 1, 3, 0, 0, 4, 1, 1, 3, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2,…
$ utilitiesConnectionFee <df[,1]> <data.frame[30 x 1]>
$ yearBuilt              <int> 2002, 1886, 1907, 2008, 1932, 1914, 1900, 1926, 1934, 1932,…
$ basementArea           <int> NA, NA, NA, NA, NA, NA, NA, NA, 88, NA, NA, NA, NA, NA, …
$ lotArea                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5327, NA, N…
$ weightedArea           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ secondaryAddressType   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

如何將數據保存到您的環境中

df <- "https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
  request() %>%
  req_perform() %>%
  resp_body_json(simplifyVector = TRUE) %>%
  pluck("cases")

選擇器有點復雜和脆弱,但現在它似乎可以工作:

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

url <- "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000"
html <- read_html(url)
html |> html_elements("div.shadow.overflow-hidden.mx-4") |>
  map_dfr(\(x)
    list( 
      "address" = html_element(x ,"div.mr-2") |> html_text2() |> str_squish(),
      "price"   = html_element(x ,"span.text-lg.pr-2") |> html_text(),
      "sqm"     = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(1) .text-sm" ) |> html_text(),
      "rooms"   = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(4) .text-sm" ) |> html_text()
      )
    )
#> # A tibble: 50 × 4
#>    address                                           price         sqm    rooms 
#>    <chr>                                             <chr>         <chr>  <chr> 
#>  1 Poppellunden 8, 4. tv. Himmelev, 4000 Roskilde    3.350.000 kr. 92 m²  3 Vær.
#>  2 Tuborgvej 60, 2. th. 2900 Hellerup                4.875.000 kr. 114 m² 4 Vær.
#>  3 Hyldegårds Tværvej 5, st. tv. 2920 Charlottenlund 6.498.000 kr. 115 m² 3 Vær.
#>  4 Grollowstræde 3 3000 Helsingør                    3.495.000 kr. 92 m²  3 Vær.
#>  5 Ryesgade 7, 2. tv. 8000 Aarhus C                  4.598.000 kr. 110 m² 4 Vær.
#>  6 Carl Th. Zahles Gade 8, 2. tv. 2300 København S   5.795.000 kr. 113 m² 3 Vær.
#>  7 Rasmus Rasks Vej 8, 2. tv. 2500 Valby             3.995.000 kr. 80 m²  3 Vær.
#>  8 Strandlodsvej 23E, 1. mf. 2300 København S        5.495.000 kr. 101 m² 3 Vær.
#>  9 Nordre Fasanvej 162, 3. th. 2000 Frederiksberg    4.695.000 kr. 90 m²  4 Vær.
#> 10 Ringstedgade 17B, 1. th. 4000 Roskilde            5.395.000 kr. 137 m² 5 Vær.
#> # … with 40 more rows

創建於 2023-02-01,使用reprex v2.0.2

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM