简体   繁体   English

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

[英]Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck.我尝试使用find_all() ,就像我通常使用 HTML 一样,但我没有同样的运气。 I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.我尝试了其他一些方法,例如转换为字符串和拆分(非常混乱),但我不想让我的代码因尝试失败而变得混乱。

Bottom line : I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file.底线:我想提取所有 NCTId(我知道我可以将整个内容转换为字符串并使用正则表达式,但我想学习如何正确解析 XML)和 XML 文件中列出的每个临床试验的官方标题. Any help is appreciated!任何帮助表示赞赏!

import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html

url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results

You can search for the field tag in lowercase, and pass name as an attribute to attrs .您可以搜索小写的field标记,并将name作为属性传递给attrs This works with just BeautifulSoup there's no need to use etree :这仅适用于BeautifulSoup ,无需使用etree

import requests
from bs4 import BeautifulSoup


url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})

you can filter on attributes like following:您可以过滤如下属性:

m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})

and then iterate each result to get text, for ex:然后迭代每个结果以获取文本,例如:

official_titles = [result.text for result in m1_officialtitle]

for more info, you can check the documentation here有关更多信息,您可以在此处查看文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Clinicaltrials.gov 解析 XML - Parse XML from Clinicaltrials.gov Python / BeautifulSoup - 从clinicaltrials.gov中提取XML数据,只能提取没有缺失数据的研究 - Python / BeautifulSoup - Extracting XML data from clinicaltrials.gov, only able to extract studies that don't have missing data 来自ClinicalTrials.gov的抓取数据 - Scrape data from clinicalTrials.gov 使用beautifulsoup进行Python网页抓取-无法从Clinicaltrials.gov提取首席调查员 - Python web scraping with beautifulsoup - can't extract Principal Investigator from Clinicaltrials.gov 从ClinicalTrials.Gov的特定字段中抓取数据 - Scrape data from specific fields from ClinicalTrials.Gov Python Beautiful Soup - web 抓取 Clinicaltrials.gov 从搜索结果中获得超过 100 个结果的 NCT 编号 - Python Beautiful Soup - web scraping Clinicaltrials.gov obtaining NCT numbers from search results w/ over 100 results Python Web 刮美汤 - Clinicaltrials.gov - 获取详细描述(新手问题) - Python Web scraping Beautiful Soup - Clinicaltrials.gov - getting detailed description (novice question) 无法使用来自 Clinicaltrials.gov 的 Beautful soup 提取表格 - unable to extract a table using Beautful soup from Clinicaltrials.gov 使用xml airnow.gov在Python中解析数据 - Parsing Data from in Python using xml airnow.gov 使用 python 从 XML 抓取数据 - Scraping data from XML with python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM