[英]how to extract data under a tag from website using bs4
<html>
<head>
<title>Index of /pub/opera/desktop/</title>
</head>
<body>
<h1>Index of /pub/opera/desktop/</h1>
<hr>
<pre><a href="../">../</a>
<a href="15.0.1147.130/">15.0.1147.130/</a> 01-Jul-2013 15:18 -
<a href="15.0.1147.132/">15.0.1147.132/</a> 01-Jul-2013 15:18 -
<a href="15.0.1147.138/">15.0.1147.138/</a> 09-Jul-2013 12:11
我需要提取版本 15.0.1147.130 和日期 01-Jul-2013 15:18 但是,使用我的代码,它只给我版本
soup = BeautifulSoup(requests.get('https://get.geo.opera.com/pub/opera/desktop/').text, 'html.parser')
for item in soup.find('pre').find_all('a')[1:]:
print(item)
我还缺少什么来获取日期文本?
你得到“A”标签,它们不包含日期
soup = BeautifulSoup(requests.get('https://get.geo.opera.com/pub/opera/desktop/').text, 'html.parser')
for item in soup.find_all('pre'):
version = item
print(version.getText().replace('/', "").replace('-', ""))
更新
import requests
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(requests.get('https://get.geo.opera.com/pub/opera/desktop/').text, 'html.parser')
lines = soup.find('pre').getText().replace('/', "").replace('-', "").split('\r')
for line in lines[1:-1]:
my_data = re.sub(' +', ' ', line).split(' ')
geo = my_data[0]
date = my_data[1]
time = my_data[2]
print(geo, date, time)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.