简体   繁体   中英

Can't find a table using Beautiful soup

I'm new to using Beautiful soup for web scrapping. I'm trying to extract a table from https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all but it's not working and I can't seem to find why. Here's what I did

import requests
from bs4 import BeautifulSoup

url = "https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all"

r = requests.get(url) #### recupérer le html
soup = BeautifulSoup(r.content) #### parser ce txt en html
table = soup.find("table",{"id":"theDataTable","class":"display dataTable no-footer"}) 

it can't find the table? why is that?

It's within the <script> tag. Need to pull it out and parse it.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

url = 'https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script')
for script in scripts:
    if 'var tableData1' in str(script):
        jsonStr = str(script)
        jsonStr = str(script).split('var tableData1 = ', 1)[-1]
        
        while True:
            try:
                jsonData = json.loads(jsonStr)
                break
            except:
                jsonStr = jsonStr.rsplit(';', 1)[0]
        


df = pd.DataFrame(jsonData)
df.columns = ['Conditions','Studies']
df['Conditions'] = [BeautifulSoup(x, 'html.parser').text for x in list(df['Conditions'])]

Output:

Conditions Studies
0                                ACTH Syndrome, Ectopic       8
1                      ACTH-Secreting Pituitary Adenoma      62
2     ACTH-independent Macronodular Adrenal Hyperplasia       2
3                              ADCY5-related Dyskinesia       2
4                                         ADNP Syndrome       2
                                                ...     ...
5653                46, XX Disorders of Sex Development      58
5654                                    47 XXX Syndrome       2
5655                                   47, XYY Syndrome       3
5656                            5-Nucleotidase Syndrome       1
5657                                       5q- Syndrome       1

[5658 rows x 2 columns]

Here is a working code:

import requests
from bs4 import BeautifulSoup

url = "https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all"

r = requests.get(url) # Fetch the page
soup = BeautifulSoup(r.content, "html.parser") # Parse the page in HTML
table = soup.find("table", { "id": "theDataTable", "class": ["display", "dataTable", "no-footer"]})

What is not working in your code is the soup.find(...) statement, where you wrote "class":"display dataTable no-footer" instead of "class": ["display", "dataTable", "no-footer"] .

BeautifulSoup requires you to pass the several classes as an array of strings, not as a single string.

You will notice that I also added "html.parser" as second argument in the BeautifulSoup(...) constructor. While this is not mandatory, it is better to put it to avoid the GuessedAtParserWarning: No parser was explicitly specified,[...] warning that Python could throw.

You can find the documentation of the libraries here:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM