简体   繁体   English

如何使用bs4从网站中提取标签下的数据

[英]how to extract data under a tag from website using bs4

<html>

<head>
  <title>Index of /pub/opera/desktop/</title>
</head>

<body>
  <h1>Index of /pub/opera/desktop/</h1>
  <hr>
  <pre><a href="../">../</a>
<a href="15.0.1147.130/">15.0.1147.130/</a>                                     01-Jul-2013 15:18                   -
<a href="15.0.1147.132/">15.0.1147.132/</a>                                     01-Jul-2013 15:18                   -
<a href="15.0.1147.138/">15.0.1147.138/</a>                                     09-Jul-2013 12:11

I need to extract version which is 15.0.1147.130 and date which is 01-Jul-2013 15:18 However, using my code, it only gives me version我需要提取版本 15.0.1147.130 和日期 01-Jul-2013 15:18 但是,使用我的代码,它只给我版本

soup = BeautifulSoup(requests.get('https://get.geo.opera.com/pub/opera/desktop/').text, 'html.parser')
for item in soup.find('pre').find_all('a')[1:]:
    print(item)

what am I missing to get the date text too?我还缺少什么来获取日期文本?

You get "A" tags, they dont contains Date你得到“A”标签,它们不包含日期

    soup = BeautifulSoup(requests.get('https://get.geo.opera.com/pub/opera/desktop/').text, 'html.parser')
    for item in soup.find_all('pre'):
    version = item
    print(version.getText().replace('/', "").replace('-', ""))

UPDADE更新

import requests
from bs4 import BeautifulSoup
import re


soup = BeautifulSoup(requests.get('https://get.geo.opera.com/pub/opera/desktop/').text, 'html.parser')
lines = soup.find('pre').getText().replace('/', "").replace('-', "").split('\r')

for line in lines[1:-1]:
    my_data = re.sub(' +', ' ', line).split(' ')
    geo = my_data[0]
    date = my_data[1]
    time = my_data[2]
    print(geo, date, time)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM