简体   繁体   中英

Scrape using Perl regex in R

Using either rvest or RSelenium when you scrape the links in R, you are able to do it by defining the begining part of HTML code, eg a href within given node. What if I face the following link:

<a href="www.website.com" data-tracking="click_body" data-tracking- 
data='{"touch_point_button":"photo"}' data-featured-name="listing_no_promo" >

If I would like to grab no promo links, then I would use (from XML and httr package) the following piece of code:

library(XML)
library(httr)
response <- GET(yourLink)
parsedoc <- htmlParse(response)
xpathSApply(parsedoc, "//a[@data-featured-tracking='listing_no_promo']", 
xmlGetAttr, "href")

What should I do in case when I would like to obtain link which ends with 'photo' part of xpath:

data-tracking- data='{"touch_point_button":"photo"}'

not caring about promo or no promo part? My guess is that curly brackets are making here some noise.

I'm assuming your example link structure is actually as follows (where data-tracking-data is the actual attribute:

<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo">link</a>

Since I don't know what site you are working with I recreated an html document by adding your link to the body of this page:

# I'm going to use the jsonlite and xml2 packages

library(jsonlite)
library(xml2)

# This page
stack_url <- "https://stackoverflow.com/questions/40934644/xpath-for-element-whose-attribute-value-ends-with-a-specific-string"

# Your html element example
test_a <- '<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo" >link</a>'

# read in stackoverflow page
raw_page <- read_html(stack_url)
# read in the element a
raw_a <- read_html(test_a)

# add the link element from example to raw_page
xml_add_child(raw_page, raw_a)
# This is just to show that the tag you provided is mixed in with multiple link elements... since this would be the case in your actual use i assume
xml_find_all(raw_page,".//a") %>% tail()

{xml_nodeset (6)}
[1] <a href="https://www.facebook.com/officialstackoverflow/" class="-link">Facebook</a>
[2] <a href="https://twitter.com/stackoverflow" class="-link">Twitter</a>
[3] <a href="https://linkedin.com/company/stack-overflow" class="-link">LinkedIn</a>
[4] <a href="https://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>
[5] <a href="https://stackoverflow.blog/2009/06/25/attribution-required/" rel="license">attribution required</a>
[6] <a href="www.website.com" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-f ...

So our xml_document is now stored to raw_page which we will then use an xpath to find what we want

.//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]

# Our xpath pattern reads as:
# 
# - .//a[ -> find all 'a' html elements where
# - attribute::*[contains(.,'{') or contains(.,'photo')] -> any(*) attribute containing either a '{' OR the string 'photo'
# - and @data-tracking -> and the element must have the attribute data-tracking, but it doesn't matter what the value is
# - ] -> end

In short-order:
Find all links that have an attribute of data-tracking AND who have an attribute containing the word photo OR the character { .

our_xpath <- ".//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]"
# Extract all of the matching elements using our xpath
# Get all the attribute values for data-tracking-data
# Parse from JSON
xml_find_all(raw_page,our_xpath) %>% xml_attr("data-tracking-data") %>% fromJSON()

Which results in:

$touch_point_button
[1] "photo"

I have no way to test against your page... but if you post the url i'd be happy to make sure it works accordingly.

//*[ends-with(@data-tracking-data, '"photo"}')]/@href

在您的示例中,如果data-tacking-data以字符串"photo"}结尾,则此xpath将为您提供href属性

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM