简体   繁体   English

Beautiful Soup 没有返回 html 表的列表

[英]Beautiful Soup not returning a list for html table

I am trying to extract the description, date and url from the table in the following page:我正在尝试从下一页的表格中提取描述、日期和 url:

https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts

For my code to be consistent with 20 other url's I need to have the logic of below ie findall of the whole body and then loop through it to find the applicable data.为了使我的代码与其他 20 个 url 保持一致,我需要具有以下逻辑,即 findall 的整个正文,然后遍历它以查找适用的数据。

The problem is that the table body is null.问题是表体是null。

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts")

c = r.content

soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("tbody") #whole table text THIS IS WHERE THE PROBLEM ORIGINATES

for item in all:
    print(item.find_all("tr").text) #test for tr text i.e. product description
    print(item.find("a")["href"]) #url
    print(item.find_all("td")[0].text) #date (won't work but can't test until tbody returns data

What am I doing wrong?我究竟做错了什么?

Thanks in advance!提前致谢!

The table in that page is dynamically loaded, using javascript, from another page.该页面中的表是使用 javascript 从另一个页面动态加载的。 Using the Developer tools in your browser, you can copy that request and use it your code . 使用浏览器中的开发人员工具,您可以复制该请求并将其用于您的代码 Then load into a pandas dataframe, and you're done:然后加载到 pandas dataframe 中,就完成了:

import requests
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'en-US,en;q=0.5',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive',
    'Referer': 'https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts',
    'TE': 'Trailers',
}

params = (
    ('_', '1589124541273'),
)

response = requests.get('https://www.fda.gov/files/api/datatables/static/recalls-market-withdrawals.json', headers=headers, params=params)

response
df = pd.read_json(response.text)

Using standard pandas method you can then extract the target information from the table.然后使用标准 pandas 方法,您可以从表中提取目标信息。

Another option, in this particular case, is to try to work with the FDA's API.在这种特殊情况下,另一种选择是尝试使用 FDA 的 API。

You can sniff the web response using Firefox - Developer Tools - Network.您可以使用 Firefox - 开发人员工具 - 网络来嗅探 web 响应。 You will find the JSON url that will be more clean and easy to parser.您会发现 JSON url 会更干净且易于解析。

https://www.fda.gov/files/api/datatables/static/recalls-market-withdrawals.json?_=1589125108944 https://www.fda.gov/files/api/datatables/static/recalls-market-withdrawals.json?_=1589125108944

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM