使用 rvest 从 HTML 表格中抓取网页

Question

I'm new to web scraping and am trying to scrape the following table:我是网络抓取的新手，正在尝试抓取下表：

                    <table class="dp-firmantes table table-condensed table->striped">
                        <thead>
                            <tr>
                                <th>FIRMANTE</th>
                                <th>DISTRITO</th>
                                <th>BLOQUE</th>
                            </tr>
                        </thead>
                        <tbody>

                            <tr>
                                <td>ROMERO, JUAN CARLOS</td>
                                <td>SALTA</td>
                                <td>JUSTICIALISTA 8 DE OCTUBRE</td>
                            </tr>
                            <tr>
                                <td>FIORE VIÑUALES, MARIA CRISTINA DEL >VALLE</td>
                            <td>SALTA</td>
                                <td>PARES</td>
                            </tr>
                            </tbody>
                    </table>

I'm using the rvest package and my code is the following:我正在使用 rvest 包，我的代码如下：

link <- read_html("https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?")
table <- html_nodes(link, 'table.dp-firmantes table table-condensed table-striped')

But when I go to look at the 'table' object in R, I get the following error: {xml_nodeset (0)}但是当我查看 R 中的“表”对象时，出现以下错误：{xml_nodeset (0)}

My instinct is that I'm actually not scraping any of the html content from the table, but I don't know how to fix this/why this is occurring.我的直觉是我实际上并没有从表格中抓取任何 html 内容，但我不知道如何解决这个问题/为什么会发生这种情况。 I'm not sure if the error is in my R code, if I'm just using the wrong CSS selector or if maybe this is javascript code and not html?我不确定错误是否在我的 R 代码中，如果我只是使用了错误的 CSS 选择器，或者这可能是 javascript 代码而不是 html？ Please let me know what I'm doing wrong here.请让我知道我在这里做错了什么。

Edited: here is the link I'm using https://www.hcdn.gob.ar/proyectos/resultados-buscador.html编辑：这是我使用的链接https://www.hcdn.gob.ar/proyectos/resultados-buscador.html

Edited: screenshot of the search results table编辑：搜索结果表的截图

Answer 1

You could try the following code to parse the "Listado de Autores" tables for those bills that have them.您可以尝试使用以下代码来解析“Listado de Autores”表中包含它们的那些帐单。 For instance bill with expendiente N. 820/18 (link = http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL ) has that table, but I webscraped the first 500 bills and did not find any other bill with such data.例如，带有 expendiente N. 820/18 (link = http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL ) 的账单有那个表，但我在网上抓取了前 500 张账单并做了找不到任何其他带有此类数据的账单。

library(tidyverse)
library(rvest)

html_object <- read_html('http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL')

html_object %>% 
html_node(xpath = "//div[@id = 'Autores']/table") %>% # This is the xpath adress that worked for me. The CSS locator ypu provide did not work.
html_table() %>% as_data_frame() %>% ## Get the html table and store it in a tibble
mutate(X1 = gsub("\\n|\\t|  ", "", X1)) ##Remove the extra line brakes (\\n), tabs (\\t), and spaces ("  ") present in the html table.

Results:结果：

# A tibble: 2 x 2
  X1
  <chr>
1 Romero, Juan Carlos
2 Fiore Viñuales, María Cristina Del Valle

Edited: Screenshot of Rś html capture thrugh read_html('https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?pagina=2')编辑：Rś html 通过 read_html 捕获的屏幕截图（'https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?pagina=2'）

使用 rvest 从 HTML 表格中抓取网页

问题描述

1 个解决方案

解决方案1
1 2018-06-15 22:34:12

使用 rvest 从 HTML 表格中抓取网页

问题描述

1 个解决方案

解决方案1 1 2018-06-15 22:34:12

解决方案1
1 2018-06-15 22:34:12