Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

Question

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck.我尝试使用find_all() ，就像我通常使用 HTML 一样，但我没有同样的运气。 I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.我尝试了其他一些方法，例如转换为字符串和拆分（非常混乱），但我不想让我的代码因尝试失败而变得混乱。

Bottom line : I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file.底线：我想提取所有 NCTId（我知道我可以将整个内容转换为字符串并使用正则表达式，但我想学习如何正确解析 XML）和 XML 文件中列出的每个临床试验的官方标题. Any help is appreciated!任何帮助表示赞赏！

import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html

url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results

Answer 1

You can search for the field tag in lowercase, and pass name as an attribute to attrs .您可以搜索小写的field标记，并将name作为属性传递给attrs 。 This works with just BeautifulSoup there's no need to use etree :这仅适用于BeautifulSoup ，无需使用etree ：

import requests
from bs4 import BeautifulSoup


url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})

Answer 2

you can filter on attributes like following:您可以过滤如下属性：

m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})

and then iterate each result to get text, for ex:然后迭代每个结果以获取文本，例如：

official_titles = [result.text for result in m1_officialtitle]

for more info, you can check the documentation here有关更多信息，您可以在此处查看文档

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

问题描述

2 个解决方案

解决方案1
0 2021-11-17 21:01:35

解决方案2
0 2021-11-17 21:21:04

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

问题描述

2 个解决方案

解决方案1 0 2021-11-17 21:01:35

解决方案2 0 2021-11-17 21:21:04

解决方案1
0 2021-11-17 21:01:35

解决方案2
0 2021-11-17 21:21:04