[英]Parse HTML data using R
我有一個html數據集,如下所示,我想解析該數據並將其轉換為可以使用的表格格式。
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="brewery" id="brewery">
<ul class="vcard simple">
<li class="name"> Bradley Farm / RB Brew, LLC</li>
<li class="address">317 Springtown Rd </li>
<li class="address_2">New Paltz, NY 12561-3020 | <a href='http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States' target='_blank'>Map</a> </li>
<li class="telephone">Phone: (845) 255-8769</li>
<li class="brewery_type">Type: Micro</li>
<li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
<div class="brewery">
<ul class="vcard simple">
<li class="name">(405) Brewing Co</li>
<li class="address">1716 Topeka St </li>
<li class="address_2">Norman, OK 73069-8224 | <a href='http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States' target='_blank'>Map</a> </li>
<li class="telephone">Phone: (405) 816-0490</li>
<li class="brewery_type">Type: Micro</li>
<li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
</body>
下面是我使用的代碼。 我面臨的問題是它使用Rvest轉換為文本文件,但似乎無法使其具有任何有用的格式。
library(dplyr)
library(rvest)
url<-html("beer.html")
selector_name<-".brewery"
fnames<-html_nodes(x = url, css = selector_name) %>%
html_text()
head(fnames)
fnames
這是正確的方法還是我應該使用其他軟件包遍歷每個div和內部元素來實現?
我想看的是
No. Name Address Type Website
謝謝。
問題在於它不是一個表,因此解析起來並不容易。 這只是兩個列表,下面的代碼將它們合並為一個列表。 另外,請嘗試查看xml2
包以解析html / xml。
library(dplyr)
library(rvest)
library(xml2)
vcard <-
'<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="brewery" id="brewery">
<ul class="vcard simple">
<li class="name"> Bradley Farm / RB Brew, LLC</li>
<li class="address">317 Springtown Rd </li>
<li class="address_2">New Paltz, NY 12561-3020 | <a href=\'http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States\' target=\'_blank\'>Map</a> </li>
<li class="telephone">Phone: (845) 255-8769</li>
<li class="brewery_type">Type: Micro</li>
<li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
<div class="brewery">
<ul class="vcard simple">
<li class="name">(405) Brewing Co</li>
<li class="address">1716 Topeka St </li>
<li class="address_2">Norman, OK 73069-8224 | <a href=\'http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States\' target=\'_blank\'>Map</a> </li>
<li class="telephone">Phone: (405) 816-0490</li>
<li class="brewery_type">Type: Micro</li>
<li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
</body>' %>%
read_html(html) %>%
xml_find_all("//ul[@class = 'vcard simple']")
two_children <- sapply(vcard, function(x) xml2::xml_children(x))
data.frame(
class = sapply(two_children, function(x) xml2::xml_attrs(x)),
value = sapply(two_children, function(x) xml2::xml_text(x)),
stringsAsFactors = FALSE
)
library(rvest)
library(dplyr)
html_file <- '<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<div class="brewery" id="brewery">
<ul class="vcard simple">
<li class="name"> Bradley Farm / RB Brew, LLC</li>
<li class="address">317 Springtown Rd </li>
<li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li>
<li class="telephone">Phone: (845) 255-8769</li>
<li class="brewery_type">Type: Micro</li>
<li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
<div class="brewery">
<ul class="vcard simple">
<li class="name">(405) Brewing Co</li>
<li class="address">1716 Topeka St </li>
<li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li>
<li class="telephone">Phone: (405) 816-0490</li>
<li class="brewery_type">Type: Micro</li>
<li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
</ul>
<ul class="vcard simple col2"></ul>
</div>
</body>'
page <- read_html(html_file)
tibble(
name = page %>% html_nodes(".vcard .name") %>% html_text(),
address = page %>% html_nodes(".vcard .address") %>% html_text(),
type = page %>% html_nodes(".vcard .brewery_type") %>% html_text() %>% stringr::str_replace_all("^Type: ", ""),
website = page %>% html_nodes(".vcard .url a") %>% html_attr("href")
)
#> # A tibble: 2 x 4
#> name address type website
#> <chr> <chr> <chr> <chr>
#> 1 Bradley Farm / RB Brew, LLC 317 Springtown Rd Micro http://www.raybradleyfarm.com
#> 2 (405) Brewing Co 1716 Topeka St Micro http://www.405brewing.com
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.