简体   繁体   English

使用 rvest 从 HTML 表格中抓取网页

[英]Web scraping from an HTML table using rvest

I'm new to web scraping and am trying to scrape the following table:我是网络抓取的新手,正在尝试抓取下表:

                    <table class="dp-firmantes table table-condensed table->striped">
                        <thead>
                            <tr>
                                <th>FIRMANTE</th>
                                <th>DISTRITO</th>
                                <th>BLOQUE</th>
                            </tr>
                        </thead>
                        <tbody>

                            <tr>
                                <td>ROMERO, JUAN CARLOS</td>
                                <td>SALTA</td>
                                <td>JUSTICIALISTA 8 DE OCTUBRE</td>
                            </tr>
                            <tr>
                                <td>FIORE VIÑUALES, MARIA CRISTINA DEL >VALLE</td>
                            <td>SALTA</td>
                                <td>PARES</td>
                            </tr>
                            </tbody>
                    </table>

I'm using the rvest package and my code is the following:我正在使用 rvest 包,我的代码如下:

link <- read_html("https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?")
table <- html_nodes(link, 'table.dp-firmantes table table-condensed table-striped')

But when I go to look at the 'table' object in R, I get the following error: {xml_nodeset (0)}但是当我查看 R 中的“表”对象时,出现以下错误:{xml_nodeset (0)}

My instinct is that I'm actually not scraping any of the html content from the table, but I don't know how to fix this/why this is occurring.我的直觉是我实际上并没有从表格中抓取任何 html 内容,但我不知道如何解决这个问题/为什么会发生这种情况。 I'm not sure if the error is in my R code, if I'm just using the wrong CSS selector or if maybe this is javascript code and not html?我不确定错误是否在我的 R 代码中,如果我只是使用了错误的 CSS 选择器,或者这可能是 javascript 代码而不是 html? Please let me know what I'm doing wrong here.请让我知道我在这里做错了什么。

Edited: here is the link I'm using https://www.hcdn.gob.ar/proyectos/resultados-buscador.html编辑:这是我使用的链接https://www.hcdn.gob.ar/proyectos/resultados-buscador.html

Edited: screenshot of the search results table编辑:搜索结果表的截图

You could try the following code to parse the "Listado de Autores" tables for those bills that have them.您可以尝试使用以下代码来解析“Listado de Autores”表中包含它们的那些帐单。 For instance bill with expendiente N. 820/18 (link = http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL ) has that table, but I webscraped the first 500 bills and did not find any other bill with such data.例如,带有 expendiente N. 820/18 (link = http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL ) 的账单有那个表,但我在网上抓取了前 500 张账单并做了找不到任何其他带有此类数据的账单。

library(tidyverse)
library(rvest)

html_object <- read_html('http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL')

html_object %>% 
html_node(xpath = "//div[@id = 'Autores']/table") %>% # This is the xpath adress that worked for me. The CSS locator ypu provide did not work.
html_table() %>% as_data_frame() %>% ## Get the html table and store it in a tibble
mutate(X1 = gsub("\\n|\\t|  ", "", X1)) ##Remove the extra line brakes (\\n), tabs (\\t), and spaces ("  ") present in the html table.

Results:结果:

# A tibble: 2 x 2
  X1
  <chr>
1 Romero, Juan Carlos
2 Fiore Viñuales, María Cristina Del Valle

Edited: Screenshot of Rś html capture thrugh read_html('https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?pagina=2')编辑:Rś html 通过 read_html 捕获的屏幕截图('https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?pagina=2')

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM