简体   繁体   中英

Beautiful Soup not returning a list for html table

I am trying to extract the description, date and url from the table in the following page:

https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts

For my code to be consistent with 20 other url's I need to have the logic of below ie findall of the whole body and then loop through it to find the applicable data.

The problem is that the table body is null.

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts")

c = r.content

soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("tbody") #whole table text THIS IS WHERE THE PROBLEM ORIGINATES

for item in all:
    print(item.find_all("tr").text) #test for tr text i.e. product description
    print(item.find("a")["href"]) #url
    print(item.find_all("td")[0].text) #date (won't work but can't test until tbody returns data

What am I doing wrong?

Thanks in advance!

The table in that page is dynamically loaded, using javascript, from another page. Using the Developer tools in your browser, you can copy that request and use it your code . Then load into a pandas dataframe, and you're done:

import requests
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'en-US,en;q=0.5',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive',
    'Referer': 'https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts',
    'TE': 'Trailers',
}

params = (
    ('_', '1589124541273'),
)

response = requests.get('https://www.fda.gov/files/api/datatables/static/recalls-market-withdrawals.json', headers=headers, params=params)

response
df = pd.read_json(response.text)

Using standard pandas method you can then extract the target information from the table.

Another option, in this particular case, is to try to work with the FDA's API.

You can sniff the web response using Firefox - Developer Tools - Network. You will find the JSON url that will be more clean and easy to parser.

https://www.fda.gov/files/api/datatables/static/recalls-market-withdrawals.json?_=1589125108944

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM