I tired using Beautifulsoup to scrape rows of table data from a locally available html file (download link provided below) without any success:
Here's my effort:
from bs4 import BeautifulSoup
import json
with open("web_summary.html", "r") as file:
html_file = file.read()
soup = BeautifulSoup(html_file, "html.parser")
script = soup.find("div", {"data-component": "CellRangerSummary", "data-key": "summary"}).find('script')
table_data = json.loads(script.text.split('=')[1], encoding='utf-8')
summary_data = table_data['summary']
summary_tab = summary_data['summary_tab']
rows = summary_tab['table']['rows']
for row in rows:
print(row[0],row[1])
Here's the expected output (rows of all tables) as a dataframe:
Number of Spots Under Tissue 2,987
Mean Reads per Spot 128,583
Median Genes per Spot 4,553
Number of Reads 384,076,450
Valid Barcodes 97.70%
Valid UMIs 99.90%
Sequencing Saturation 80.20%
Q30 Bases in Barcode 98.90%
Q30 Bases in RNA Read 89.60%
Q30 Bases in UMI 98.80%
Reads Mapped to Genome 86.00%
Reads Mapped Confidently to Genome 79.10%
Reads Mapped Confidently to Intergenic Regions 5.20%
Reads Mapped Confidently to Intronic Regions 0.00%
Reads Mapped Confidently to Exonic Regions 73.90%
Reads Mapped Confidently to Transcriptome 65.60%
Reads Mapped Antisense to Gene 1.40%
Fraction Reads in Spots Under Tissue 97.30%
Mean Reads per Spot 128,583
Median Genes per Spot 4,553
Total Genes Detected 21,673
Median UMI Counts per Spot 14,169
Any ideas (Beautifulsoup or any other framework) to make my code work?
The tabular contents you look for are not neatly available in a particular table; rather, they are present in different tables found sporadically within the script tag. My suggested script tries to fetch all the data from different tables. However, the closest possible solution using the way you started:
from bs4 import BeautifulSoup
import requests
import json
link = 'https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_data/rawdata/ST8059048/web_summary.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
script = soup.find("div", {"data-component":"CellRangerSummary", "data-key":"summary"}).find('script')
table_data = json.loads(script.contents[0].strip().split('const data = ')[1])
summary_data = table_data['summary']
for item,val in summary_data['summary_tab'].items():
if not val.get('table'): continue
rows = val['table']['rows']
for row in rows:
print(row[0],row[1])
Output:
Number of Reads 384,076,450
Valid Barcodes 97.7%
Valid UMIs 99.9%
Sequencing Saturation 80.2%
Q30 Bases in Barcode 98.9%
Q30 Bases in RNA Read 89.6%
Q30 Bases in UMI 98.8%
Fraction Reads in Spots Under Tissue 97.3%
Mean Reads per Spot 128,583
Median Genes per Spot 4,553
Total Genes Detected 21,673
Median UMI Counts per Spot 14,169
Reads Mapped to Genome 86.0%
Reads Mapped Confidently to Genome 79.1%
Reads Mapped Confidently to Intergenic Regions 5.2%
Reads Mapped Confidently to Intronic Regions 0.0%
Reads Mapped Confidently to Exonic Regions 73.9%
Reads Mapped Confidently to Transcriptome 65.6%
Reads Mapped Antisense to Gene 1.4%
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.