用RVest读取HTML时不希望有的反斜杠

Question

I am trying to read a website using rvest, my code as follows: 我正在尝试使用rvest阅读网站，我的代码如下：

pg <- read_html("https://www.gob.mx/presidencia/archivo/prensa?utf8=%E2%9C%93&idiom=es&style=list&order=DESC&filter_id=&filter_origin=archive&tags=&year=&category=Discursos+del+Presidente&year=&category=Discursos+del+Presidente")

However, when I read "pg" I get double backslashes between html-classes, like in the following snippet: 但是，当我阅读“ pg”时，在html类之间会出现双反斜杠，如以下代码片段所示：

<a class='\\"small-link\\"' href="%5C%22/presidencia/es/prensa/epn-palabras-134612?idiom=es%5C%22" target='\\"_blank\\"'>

This does not occur when I read other websites: 当我阅读其他网站时，不会发生这种情况：

pg2 <- read_html("http://www.imdb.com/title/tt0245712/")
#output: <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n

Any idea why this might happen? 知道为什么会发生这种情况吗？ I really want to get rid of it, since it hinders me from retrieving data with html_nodes(): 我真的想摆脱它，因为它阻碍了我使用html_nodes（）检索数据：

pg  %>%
  html_nodes(".small-link")
#output: {xml_nodeset (0)}

Update! 更新！

This error seems to happen only when using rvest with Mexican IPs :/ 仅当将rvest与墨西哥IP一起使用时，似乎才会发生此错误：/
Following a suggestion below , I tried using regex to clean my object ("pg"). 按照下面的建议，我尝试使用正则表达式清理对象（“ pg”）。

So when looking at pg div classes have these double backslashes like this: 因此，在查看pg div类时，请使用以下双反斜杠：

pg 
#Result: <div class='\\"col-md-12' small-bottom-buffer>

If I clean pg trying to delete one backslash it seems to work and I only have one left: 如果我清理pg试图删除一个反斜杠，它似乎可以正常工作，而我只剩下一个：

pg2 <- gsub("\\\\", "", pg)
pg2
#Result: <div class='\"col-md-12' small-bottom-buffer>

However, if I try to delete both backslashes, I get three back instead!: 但是，如果我尝试删除两个反斜杠，我反而会得到三个反斜杠！：

pg3 <- gsub("\\\\\\\\", "", pg)
pg3
#<div class='\\\"col-md-12' small-bottom-buffer>

I don't understand this behaviour 我不明白这种行为

Answer 1

I'm not familiar enough with rvest to provide an rvest solution, but you can use readLines and grep to find the data you're looking for. 我对rvest不够熟悉，无法提供rvest解决方案，但是您可以使用readLines和grep查找所需的数据。 You can then use REGEX to clean it up 然后，您可以使用REGEX进行清理

pg3 <- readLines("https://www.gob.mx/presidencia/archivo/prensa?utf8=%E2%9C%93&idiom=es&style=list&order=DESC&filter_id=&filter_origin=archive&tags=&year=&category=Discursos+del+Presidente&year=&category=Discursos+del+Presidente")

grep('<a class=\"small-link\"', pg3, value = TRUE)
grep('<a class="small-link"', pg3, value = TRUE)
grep('<a class=\\"small-link\\"', pg3, value = TRUE)

All three work. 所有这三个工作。 The reason you are seeing \\" is because \\ is an escape character, and " is a special character since it's used to input character data into R. For example: 您看到\\“的原因是\\是转义字符，而”是特殊字符，因为它用于将字符数据输入到R中。例如：

> print("test"test")
Error: unexpected symbol in "print("test"test"
> print('test"test')
[1] "test\"test"
> print("test\"test")
[1] "test\"test"

用RVest读取HTML时不希望有的反斜杠

问题描述

1 个解决方案

解决方案1
0 2017-11-17 21:12:15

用RVest读取HTML时不希望有的反斜杠

问题描述

1 个解决方案

解决方案1 0 2017-11-17 21:12:15

解决方案1
0 2017-11-17 21:12:15