[英]Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags
I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using
find_all()
like I typically do with HTML, but I'm not having the same luck.我尝试使用
find_all()
,就像我通常使用 HTML 一样,但我没有同样的运气。 I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.我尝试了其他一些方法,例如转换为字符串和拆分(非常混乱),但我不想让我的代码因尝试失败而变得混乱。
Bottom line : I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file.底线:我想提取所有 NCTId(我知道我可以将整个内容转换为字符串并使用正则表达式,但我想学习如何正确解析 XML)和 XML 文件中列出的每个临床试验的官方标题. Any help is appreciated!
任何帮助表示赞赏!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
You can search for the field
tag in lowercase, and pass name
as an attribute to attrs
.您可以搜索小写的
field
标记,并将name
作为属性传递给attrs
。 This works with just BeautifulSoup
there's no need to use etree
:这仅适用于
BeautifulSoup
,无需使用etree
:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
you can filter on attributes like following:您可以过滤如下属性:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:然后迭代每个结果以获取文本,例如:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here有关更多信息,您可以在此处查看文档
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.