简体   繁体   中英

Unable to scrape data from a local html file using Beautifulsoup

I tired using Beautifulsoup to scrape rows of table data from a locally available html file (download link provided below) without any success:

Here's my effort:

from bs4 import BeautifulSoup
import json


with open("web_summary.html", "r") as file:
    html_file = file.read()

soup = BeautifulSoup(html_file, "html.parser")

script = soup.find("div", {"data-component": "CellRangerSummary", "data-key": "summary"}).find('script')
table_data = json.loads(script.text.split('=')[1], encoding='utf-8')
summary_data = table_data['summary']
summary_tab = summary_data['summary_tab']

rows = summary_tab['table']['rows']

for row in rows:
    print(row[0],row[1])

html file download link

Here's the expected output (rows of all tables) as a dataframe:

Number of Spots Under Tissue    2,987
Mean Reads per Spot 128,583
Median Genes per Spot   4,553
Number of Reads 384,076,450
Valid Barcodes  97.70%
Valid UMIs  99.90%
Sequencing Saturation   80.20%
Q30 Bases in Barcode    98.90%
Q30 Bases in RNA Read   89.60%
Q30 Bases in UMI    98.80%
Reads Mapped to Genome  86.00%
Reads Mapped Confidently to Genome  79.10%
Reads Mapped Confidently to Intergenic Regions  5.20%
Reads Mapped Confidently to Intronic Regions    0.00%
Reads Mapped Confidently to Exonic Regions  73.90%
Reads Mapped Confidently to Transcriptome   65.60%
Reads Mapped Antisense to Gene  1.40%
Fraction Reads in Spots Under Tissue    97.30%
Mean Reads per Spot 128,583
Median Genes per Spot   4,553
Total Genes Detected    21,673
Median UMI Counts per Spot  14,169

Any ideas (Beautifulsoup or any other framework) to make my code work?

The tabular contents you look for are not neatly available in a particular table; rather, they are present in different tables found sporadically within the script tag. My suggested script tries to fetch all the data from different tables. However, the closest possible solution using the way you started:

from bs4 import BeautifulSoup
import requests
import json

link = 'https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_data/rawdata/ST8059048/web_summary.html'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")

script = soup.find("div", {"data-component":"CellRangerSummary", "data-key":"summary"}).find('script')
table_data = json.loads(script.contents[0].strip().split('const data = ')[1])
summary_data = table_data['summary']
for item,val in summary_data['summary_tab'].items():
    if not val.get('table'): continue
    rows = val['table']['rows']

    for row in rows:
        print(row[0],row[1])

Output:

Number of Reads 384,076,450
Valid Barcodes 97.7%
Valid UMIs 99.9%
Sequencing Saturation 80.2%
Q30 Bases in Barcode 98.9%
Q30 Bases in RNA Read 89.6%
Q30 Bases in UMI 98.8%
Fraction Reads in Spots Under Tissue 97.3%
Mean Reads per Spot 128,583
Median Genes per Spot 4,553
Total Genes Detected 21,673
Median UMI Counts per Spot 14,169
Reads Mapped to Genome 86.0%
Reads Mapped Confidently to Genome 79.1%
Reads Mapped Confidently to Intergenic Regions 5.2%
Reads Mapped Confidently to Intronic Regions 0.0%
Reads Mapped Confidently to Exonic Regions 73.9%
Reads Mapped Confidently to Transcriptome 65.6%
Reads Mapped Antisense to Gene 1.4%

Pandas has a read_html which works for your case

import pandas as pd

#the sequencing/mapping/spots/sample tables are separate, concat them
df = pd.concat(pd.read_html('web_summary.html'))
df.columns = ['field','value']
print(df)

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM