[英]how to print 1st element in HTML tag
我的代码从页面的不同“部分”获取链接/HTML。
它每部分打印 2 个链接,但我只希望打印第一个。
预期的 output 不应包含以“视频”结尾的链接,就像我的代码一样。
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
jam=[]
baseurl='https://meetinglibrary.asco.org'
driver.get('https://meetinglibrary.asco.org/results?meetingView=2020%20ASCO%20Virtual%20Scientific%20Program&page=1')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('a',class_='ng-star-inserted')
for item in productlist:
for link in item.find_all('a',href=True):
jam.append(baseurl+link['href'])
print(jam)
使用os.path.basename
获取字符串的结尾。并使用in
运算符检查"video"
是否存在:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import os
driver = webdriver.Chrome()
jam = []
baseurl = 'https://meetinglibrary.asco.org'
driver.get('https://meetinglibrary.asco.org/results?meetingView=2020%20ASCO%20Virtual%20Scientific%20Program&page=1')
time.sleep(3)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
productlist = soup.find_all('a', class_='ng-star-inserted')
for item in productlist:
for link in item.find_all('a', href=True):
url = link['href']
if "video" not in os.path.basename(url):
jam.append(baseurl + url)
print(jam)
结果:
['https://meetinglibrary.asco.org/record/185955/abstract',
'https://meetinglibrary.asco.org/record/185955/slide',
'https://meetinglibrary.asco.org/record/185954/abstract',
'https://meetinglibrary.asco.org/record/186048/abstract',
'https://meetinglibrary.asco.org/record/186048/slide',
'https://meetinglibrary.asco.org/record/190197/slide',
'https://meetinglibrary.asco.org/record/192623/slide',
'https://meetinglibrary.asco.org/record/185414/abstract',
'https://meetinglibrary.asco.org/record/185414/slide',
'https://meetinglibrary.asco.org/record/185415/abstract',
'https://meetinglibrary.asco.org/record/185415/slide',
'https://meetinglibrary.asco.org/record/185473/abstract',
'https://meetinglibrary.asco.org/record/185473/slide',
'https://meetinglibrary.asco.org/record/187584/slide',
'https://meetinglibrary.asco.org/record/188561/slide',
'https://meetinglibrary.asco.org/record/186710/abstract',
'https://meetinglibrary.asco.org/record/186710/slide',
'https://meetinglibrary.asco.org/record/186699/abstract',
'https://meetinglibrary.asco.org/record/186699/slide',
'https://meetinglibrary.asco.org/record/186698/abstract',
'https://meetinglibrary.asco.org/record/186698/slide',
'https://meetinglibrary.asco.org/record/187720/slide',
'https://meetinglibrary.asco.org/record/187480/abstract',
'https://meetinglibrary.asco.org/record/187480/slide',
'https://meetinglibrary.asco.org/record/191961/slide',
'https://meetinglibrary.asco.org/record/192626/slide',
'https://meetinglibrary.asco.org/record/186983/abstract',
'https://meetinglibrary.asco.org/record/186983/slide',
'https://meetinglibrary.asco.org/record/188580/abstract',
'https://meetinglibrary.asco.org/record/188580/slide',
'https://meetinglibrary.asco.org/record/189047/abstract',
'https://meetinglibrary.asco.org/record/189047/slide',
'https://meetinglibrary.asco.org/record/190223/slide',
'https://meetinglibrary.asco.org/record/190273/slide',
'https://meetinglibrary.asco.org/record/184812/abstract',
'https://meetinglibrary.asco.org/record/184812/slide',
'https://meetinglibrary.asco.org/record/184927/slide',
'https://meetinglibrary.asco.org/record/184805/abstract',
'https://meetinglibrary.asco.org/record/184805/slide',
'https://meetinglibrary.asco.org/record/184811/abstract',
'https://meetinglibrary.asco.org/record/184811/slide',
'https://meetinglibrary.asco.org/record/185576/slide',
'https://meetinglibrary.asco.org/record/190147/slide']
您可以在附加脚本之前使用条件 function。
...
for item in productlist:
ahrefs = item.find_all('a', href=True)
for index in range(len(ahrefs)):
if (index % 2 == 0) and ('video' not in ahrefs[index]['href']):
jam.append(baseurl+ahrefs[index]['href'])
print(jam)
...
尝试后告诉我。 祝你好运
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.