[英]Web scraping using Python BeautifulSoup
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url="http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
uClient=uReq(my_url)
page_html=uClient.read()
page_soup=soup(page_html,"html.parser")
containers=page_soup.findAll("div",{"class":"row review-article"})
print(len(containers))
print(containers[0].a)
我想将配置文件的链接(给定图片中的Chitanverma)作为我的输出,但是我将Reliance Jio服务的链接作为我的输出。
如果有人帮助我更正代码以获得预期的输出,并解释为什么我将Reliance Jio服务的链接作为输出,将不胜感激。
我的意图是从http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061网页上删除个人资料的名称。
在这种情况下,您必须依靠任何浏览器模拟器来获取动态生成的内容。 硒可以作为一种选择。 如果您已经在计算机中安装了硒,请尝试以下示例。
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
soup = BeautifulSoup(driver.page_source,"lxml")
for link in soup.select(".profile"):
try:
profile = link.select("p:nth-of-type(1) a")[0]
except:pass
print(profile.text, profile['href'])
driver.quit()
部分输出:
chintanverma http://www.mouthshut.com/chintanverma
ganeshgauttam http://www.mouthshut.com/ganeshgauttam
viratvenkat1 http://www.mouthshut.com/viratvenkat1
ms37872 http://www.mouthshut.com/ms37872
bibekdas http://www.mouthshut.com/bibekdas
带有用户数据的div的正确选择器是:
containers = page_soup.findAll("div", {"class": "profile"})
first_container = containers[0]
但是此DOM片段是通过调用javascript方法getuserprofile呈现的,因此您无法使用beatifulsoup检索它,因为它返回:
<div class="col-2 profile" id="ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_ctl00_divProfile"><script>
getuserprofile(1318536,8393808,0,1,0,'','ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_ctl00_divProfile',3,'ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_ctl00_spnview','ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_ctl00_smdatetime')
</script></div>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.