[英]rvest scraping html content values
I'm trying to scrape the following page: link in order to create a data frame with 6 columns: position, company and meta (1-5).我正在尝试抓取以下页面: 链接以创建一个具有 6 列的数据框:位置、公司和元数据 (1-5)。 Unfortunately I don't know how to catch the values in
content
for example <meta itemprop="jobLocation" content="Tauragė" />
so the value Tauragė would be used in creating my dataframe (in this example).不幸的是,我不知道如何捕获
content
的值,例如<meta itemprop="jobLocation" content="Tauragė" />
所以值 Tauragė 将用于创建我的数据框(在本例中)。
My initial code:我的初始代码:
if(!require("tidyverse")) install.packages("tidyverse"); library("tidyverse")
if(!require("rvest")) install.packages("rvest"); library("rvest")
# setting url and reading html code
url <- "https://www.cv.lt/employee/announcementsAll.do?regular=true&salaryInterval=-1&interval=2&ipp=1000"
html <- read_html(url, encoding = "utf-8")
# creating a dataframe of ads
ads <- html %>%{
data.frame(
position=html_nodes(html, "tbody p a:nth-child(1)") %>% html_text(),
company=html_nodes(html, "tbody p a:nth-child(2)")%>% html_text(),
meta1=...
meta2=...
meta3=...
meta4=...
meta5=...
)}
an example of html code: html代码示例:
<td>
<p itemscope itemtype="http://schema.org/JobPosting">
<a href="/valstybes-tarnyba/vsi-taurages-rajono-pirmines-sveikatos-prieziuros-centro-direktorius-taurageje-2-338912727/?sri=83" target="_blank" itemprop="title" onclick="$(this).parents('tr.data').addClass('read');">VšĮ Tauragės rajono pirminės sveikatos priežiūros centro direktorius</a>
<a href="/viesoji-istaiga-taurages-rajono-pirmines-sveikatos-prieziuros-centras-darbo-skelbimai" target="_blank" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization"><span itemprop="name">Viešoji įstaiga Tauragės rajono pirminės sveikatos priežiūros centras</span></a>
<meta itemprop="jobLocation" content="Tauragė" />
<meta itemprop="datePosted" content="2019-08-22" />
<meta itemprop="employmentType" content="FULL_TIME" />
<meta itemprop="validThrough" content="2019-09-06T00:00:00.000" />
<meta itemprop="url" content="https://www.cv.lt/valstybes-tarnyba/vsi-taurages-rajono-pirmines-sveikatos-prieziuros-centro-direktorius-taurageje-2-338912727" />
</p>
</td>
<td>
You can run this,你可以运行这个,
my_content <- html %>% html_nodes("tbody p meta") %>% html_attr("content")
After that, by indexing each of them, you can split them into meta1, meta2,...meta5 like,之后,通过索引它们中的每一个,您可以将它们拆分为meta1、meta2、...meta5 之类的,
index <- rep(1:5,101)
meta <- data.frame(Meta= my_content,Index=index)
meta1 <- meta[meta$Index==1,]
meta2 <- meta[meta$Index==2,]
meta3 <- meta[meta$Index==3,]
meta4 <- meta[meta$Index==4,]
meta5 <- meta[meta$Index==5,]
EDIT :编辑 :
Another approach is using the itemprop
values inside html_nodes()
另一种方法是在
html_nodes()
使用itemprop
值
html %>% html_nodes("[itemprop='jobLocation']") %>% html_attr("content")
gives only the Meta1 for you.只为您提供Meta1 。 If you use the
itemprop
values for each Meta , you can take the data inside them like,如果您为每个Meta使用
itemprop
值,您可以像这样获取其中的数据,
meta1 <- html %>% html_nodes("[itemprop='jobLocation']") %>% html_attr("content")
meta2 <- html %>% html_nodes("[itemprop='datePosted']") %>% html_attr("content")
meta3 <- html %>% html_nodes("[itemprop='employmentType']") %>% html_attr("content")
meta4 <- html %>% html_nodes("[itemprop='validThrough']") %>% html_attr("content")
meta5 <- html %>% html_nodes("[itemprop='url']") %>% html_attr("content")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.