[英]Web scraping with python ('NoneType' object has no attribute 'get_text')
I would like to extract multiple drug informations from multiple pages in https://www.medindia.net/doctors/drug_information/abacavir.htm , https://www.medindia.net/doctors/drug_information/talimogene_laherparepvec.htm , and etc我想从https://www.medindia.net/doctors/drug_information/abacavir.htm 、 https://www.medindia.net/doctors/drug_information/talimogene_laherparepvec.htm等多个页面中提取多个药物信息
On each pages, The information that I would like to extract are as follows: General, Brands, Prescription Contraindications, Side effects, Dosage, How to Take, Warning and Storage.在每一页上,我想提取的信息如下:一般、品牌、处方禁忌、副作用、剂量、如何服用、警告和储存。
By using Beautiful soup, I am able to identify the class needed for extraction.通过使用 Beautiful Soup,我能够确定提取所需的类。 However, when i am trying to extract the information and store the information into a variable, it shows the
'NoneType' object has no attribute 'get_text'
.但是,当我尝试提取信息并将信息存储到变量中时,它显示
'NoneType' object has no attribute 'get_text'
。 It seems that there is no element with the class 'drug-content'.似乎没有“药物含量”类的元素。 However, when I print the items it shows the class.
但是,当我打印项目时,它会显示类。 Please help me.
请帮我。 Below is my code:
下面是我的代码:
import pandas as pd
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.medindia.net/doctors/drug_information/abacavir.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
drug = soup.find(class_='mi-container__fluid')
print(drug)
# whole page contain drug content
items = drug.find_all(class_='drug-content')
print(items)
# extract drug information from drug content into individual variable
general = items[0].find(class_='drug-content').get_text(strip=True).replace("\n", "")
brand = items[1].find(class_='report-content').get_text(strip=True).replace("\n", "")
prescription = items[1].find(class_='drug-content').get_text(strip=True).replace("\n", "")
contraindications = items[2].find(class_='drug-content').get_text(strip=True).replace("\n", "")
side_effect = items[2].find(class_='drug-content').get_text(strip=True).replace("\n", "")
dosage = items[3].find(class_='drug-content').get_text(strip=True).replace("\n", "")
how_to_use = items[4].find(class_='drug-content').get_text(strip=True).replace("\n", "")
warnings = items[5].find(class_='drug-content').get_text(strip=True).replace("\n", "")
storage = items[7].find(class_='drug-content').get_text(strip=True).replace("\n", "")
I have try to change the class to 'report-content drug-widget'.我尝试将课程更改为“报告内容药物小部件”。 However, with that class, I am unable to extract the general information.
但是,对于该课程,我无法提取一般信息。 And also side-effect is unavailable for this drug.
而且这种药物也没有副作用。 How can I put an NA into the variable if the information is not available for the drug.
如果该药物的信息不可用,我如何将 NA 放入变量中。
# whole page contain drug content
items = drug.find_all(class_='report-content drug-widget')
print(items)
# extract drug information from drug content into individual variable
general = items.find(class_='drug-content').get_text(strip=True).replace("\n", "")
brand = items[0].find(class_='drug-content').get_text(strip=True).replace("\n", "")
Please advice how to extract the information and how can I put NA where information which I need are not available.请建议如何提取信息以及如何将 NA 放在我需要的信息不可用的地方。
I can help you with the first one, it should help you get started on how to deal with non finds, and how to search for the pattern your looking for:我可以帮助您解决第一个问题,它应该可以帮助您开始了解如何处理未找到的问题,以及如何搜索您要查找的模式:
try:
general = items[0].find('h3', attrs={'style': 'margin:0px!important'}).get_text(strip=True).replace("\n", "").replace("\xa0", " ")
except:
general = "N/A"
You can slice the Generic Name: out since it's probably the same size for each answer by:您可以通过以下方式将 Generic Name: 切片,因为每个答案的大小可能相同:
general = general[15:]
print(general):
#'Abacavir'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.