简体   繁体   English

使用robobrowser和beautifulsoup解析网页

[英]Parsing webpage with robobrowser and beautifulsoup

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. 在使用robobrowser进行表单提交后,我开始尝试解析网站的webscraping。 I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. 我得到了正确的数据(我可以在以下时查看它:print(browser.parsed))但是在解析它时遇到了问题。 The relevant part of the source code of the webpage looks like this: 网页源代码的相关部分如下所示:

<div id="ii">
<tr>
  <td scope="row" id="t1a"> ID (ID Number)</a></td>
  <td headers="t1a">1234567 &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1b">Participant Name</td>
  <td headers="t1b">JONES, JOHN                          &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1c">Sex</td>
  <td headers="t1c">MALE   &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1d">Date of Birth</td>
  <td headers="t1d">11/25/2016 &nbsp;</td>
</tr>
<tr>
  <td scope="row" id="t1e">Race / Ethnicity</a></td>
  <td headers="t1e">White                  &nbsp;</td>
</tr>

if I do 如果我做

in: browser.select('#t1b")

I get: 我明白了:

out: [<td id="t1b" scope="row">Inmate Name</td>]

instead of JONES, JOHN. 约翰,而不是乔恩斯。

The only way I've been able to get the relevant data is doing: 我能够获得相关数据的唯一方法是:

browser.select('tr')

This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info. 这将返回29个每个'tr'的列表,我可以将其转换为文本并搜索相关信息。

I've also tried creating a BeautifulSoup object: 我也尝试过创建一个BeautifulSoup对象:

x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")

but it loses all tags/ids and so I can't figure out how to search within it. 但它丢失所有标签/ ids,所以我无法弄清楚如何在其中搜索。

Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it? 是否有一种简单的方法让它使用'tr'循环遍历每个元素并获取实际数据而不是标签反对重复转换为字符串变量并搜索它?

Thanks 谢谢

Get all the "label" td elements and get the next td sibling value collecting results into a dict: 获取所有“标签” td元素并获取下一个td兄弟值,将结果收集到一个字典中:

from pprint import pprint
from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <td scope="row" id="t1a"> ID (ID Number)</a></td>
      <td headers="t1a">1234567 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1b">Participant Name</td>
      <td headers="t1b">JONES, JOHN                          &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1c">Sex</td>
      <td headers="t1c">MALE   &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1d">Date of Birth</td>
      <td headers="t1d">11/25/2016 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1e">Race / Ethnicity</a></td>
      <td headers="t1e">White                  &nbsp;</td>
    </tr>
</table>
"""

soup = BeautifulSoup(data, 'html5lib')

data = {
    label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
    for label in soup.select("tr > td[scope=row]")
}
pprint(data)

Prints: 打印:

{'Date of Birth': '11/25/2016',
 'ID (ID Number)': '1234567',
 'Participant Name': 'JONES, JOHN',
 'Race / Ethnicity': 'White',
 'Sex': 'MALE'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM