简体   繁体   English

在 Xpath 中同时转义双引号和单引号

[英]Simultaneously escape double and single quotes in Xpath

Similar to How to deal with single quote in xpath , I want to escape single quotes.类似于How to deal with single quote in xpath ,我想转义单引号。 The difference is that I can't exclude the possibility that a double quote might also appear in the target string.不同之处在于我不能排除双引号也可能出现在目标字符串中的可能性。

Goal:目标:

Escape double and single quotes simultaneously with Xpath (in R).使用 Xpath(在 R 中)同时转义双引号和单引号。 The target element should be used as a variable and not be hard coded like in one of the existing answers.目标元素应用作变量,而不是像现有答案之一那样进行硬编码。 (It should be a variable, because I am unaware of the content beforehand, it could have single quotes, double quotes or both). (它应该是一个变量,因为我事先不知道内容,它可能有单引号、双引号或两者都有)。

Works:作品:

library(rvest)
library(magrittr)
html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (1)}
[1] <div>Father's son</div>

Does not work:不起作用:

html <- "<div>1</div><div>Fat\"her's son</div>"
target <- "Fat\"her's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (0)}
Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid expression [1207]

Update更新

Non-R answers that I could try to "translate to R" are very welcome.我可以尝试“翻译为 R”的非 R 答案非常受欢迎。

Because you are using string manipulation to build your XPath expression, it's your responsibility that the expression is valid XPath.因为您使用字符串操作来构建 XPath 表达式,所以表达式是有效的 XPath 是您的责任。 This expression:这个表达:

//*[contains(.,concat('Fat"',"her's son"))]

Selects:选择:

<div>Fat"her's son</div>

Test in here这里测试

It would be a better approach to use an XPath string variable, but it looks like R doesn't have an API for that, even using libxml.使用 XPath 字符串变量是一种更好的方法,但看起来 R 没有 API,即使使用 libxml。

The key here is realising that with xml2 you can write back into the parsed html with html-escaped characters.这里的关键是意识到使用 xml2,您可以使用 html 转义字符写回解析后的 html。 This function will do the trick.这个函数可以解决问题。 It's longer than it needs to be because I've included comments and some type checking / converting logic.它比它需要的要长,因为我已经包含了注释和一些类型检查/转换逻辑。

contains_text <- function(node_set, find_this)
{
  # Ensure we have a nodeset
  if(all(class(node_set) == c("xml_document", "xml_node")))
    node_set %<>% xml_children()

  if(class(node_set) != "xml_nodeset")
    stop("contains_text requires an xml_nodeset or xml_document.")

  # Get all leaf nodes
  node_set %<>% xml_nodes(xpath = "//*[not(*)]")

  # HTML escape the target string
  find_this %<>% {gsub("\"", "&quot;", .)}

  # Extract, HTML escape and replace the nodes
  lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", "&quot;", .)})

  # Now we can define the xpath and extract our target nodes
  xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
  new_nodes <- html_nodes(node_set, xpath = xpath)

  # Since the underlying xml_document is passed by pointer internally,
  # we should unescape any text to leave it unaltered
  xml_text(node_set) %<>% {gsub("&quot;", "\"", .)}
  return(new_nodes)
}

Now:现在:

library(rvest)
library(xml2)

html %>% xml2::read_html() %>% contains_text(target)
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
html %>% xml2::read_html() %>% contains_text(target) %>% xml_text()
#> [1] "Fat\"her's son"

ADDENDUM附录

This is an alternative method, which is an implementation of the method suggested by @Alejandro but allows arbitrary targets.这是一种替代方法,它是@Alejandro 建议的方法的实现,但允许任意目标。 It has the merit of leaving the xml document untouched, and is a little faster than the above method, but involves the kind of string parsing that an xml library is supposed to prevent.它的优点是不影响 xml 文档,并且比上述方法快一点,但涉及 xml 库应该阻止的那种字符串解析。 It works by taking the target, splitting it after each " and ' , then enclosing each fragment in the opposite type of quote to the one it contains before pasting them all back together with commas and inserting them into an XPath concatenate function.它的工作原理是获取目标,在每个"'之后将其拆分,然后将每个片段包含在与其包含的引用类型相反的引用类型中,然后用逗号将它们全部粘贴回一起并将它们插入到 XPath concatenate函数中。

library(stringr)

safe_xpath <- function(target)
{
  target                                 %<>%
  str_replace_all("\"", "&quot;&break;") %>%
  str_replace_all("'", "&apo;&break;")   %>%
  str_split("&break;")                   %>%
  unlist()

  safe_pieces    <- grep("(&quot;)|(&apo;)", target, invert = TRUE)
  contain_quotes <- grep("&quot;", target)
  contain_apo    <- grep("&apo;", target)

  if(length(safe_pieces) > 0) 
      target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")

  if(length(contain_quotes) > 0)
  {
    target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
    target[contain_quotes] <- gsub("&quot;", "\"", target[contain_quotes])
  }

  if(length(contain_apo) > 0)
  {
    target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
    target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
  }

  fragment <- paste0(target, collapse = ",")
  return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}

Now we can generate a valid xpath like this:现在我们可以像这样生成一个有效的 xpath:

safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"

so that以便

html %>% xml2::read_html() %>% html_nodes(xpath = safe_xpath(target))
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>

use quote() for xpath query使用quote()进行 xpath 查询

library(XML)

only single quote inside string字符串中只有单引号

target1 <- "Father's son"
doc1 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc1, "//body"), doc = doc1)
newXMLNode("div", target1, parent = getNodeSet(doc1, "//body"), doc = doc1)
xpath_query1 <- paste0('//*[ contains(text(), ', '"', target1, '"', ')]')
getNodeSet(doc1, xpath_query1)

both single and double quote inside string字符串内的单引号和双引号

target2 <- "Fat\"her's son"
doc2 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc2, "//body"), doc = doc2)
newXMLNode("div", target2, parent = getNodeSet(doc2, "//body"), doc = doc2)
xpath_query2 <- quote('//body/*[contains(.,concat(\'Fat"\',"her\'s son"))]')
getNodeSet(doc2, xpath_query2)

Output:输出:

getNodeSet(doc1, xpath_query1)
# [[1]]
# <div>Father's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"

getNodeSet(doc2, xpath_query2)
# [[1]]
# <div>Fat"her's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"

I added the cat function to the target inside the html_nodes() function call.我在html_nodes()函数调用中向目标添加了cat函数。 Seems to handle both the cases.似乎处理这两种情况。 cat() also has the side-effect of printing the escaped text. cat()还具有打印转义文本的副作用。

library(rvest)
library(magrittr)

html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father's son
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father's son</div>\n</body></html>
#> [2] <body>\n<div>1</div>\n<div>Father's son</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father's son</div>

html <- "<div>1</div><div>Father said \"Hello!\"</div>"
target <- 'Father said "Hello!"'
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father said "Hello!"
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body> ...
#> [2] <body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father said "Hello!"</div>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM