简体   繁体   中英

How to scrape dates of News Site

I am trying to scrap the news website with news that are valid of a certain date. The output of the function return :

<li class="meta-data"><time data-datetime="relative" datetime="2022-01-30T08:56:09Z" title="2022-01-30T08:56:09Z">January 30, 2022 08:56</time></li>

How can I only print the Date time only? Printing i.text dont seem to work.

Below is the code.

 import requests from bs4 import BeautifulSoup import datetime as datetime from datetime import timedelta import pandas as pd pd.set_option('display.max_columns',None) pd.set_option('max_colwidth',None) def okx_scrap(): b = [] url = 'https://www.okex.com/support/hc/en-us/sections/360000030652-Latest-Announcements' page = requests.get(url) soup = BeautifulSoup(page.content,'html.parser') small_soup = soup.find_all(class_ = "article-list-link") url_1st = 'https://www.okex.com/support' #Getting Yesterday's Date for i in small_soup: full_url = url_1st +(i['href']) page2 = requests.get(full_url) soup2 = BeautifulSoup(page2.content,'html.parser') small_soup2 = soup2.find_all('li', {'class': 'meta-data'}) #print(small_soup2) for i in small_soup2: print(i) okx_scrap()

Considering i<\/code> as a string (if not typecase the variable i<\/code> to a string using built in method i = str(i)<\/code> )

i = str(i)
i = i.split("><")[1]
i = i.split("datetime=")[2]
i = i.split("\"")[1]

print(i)
# 2022-01-30T08:56:09Z


you can use regex:

import re

string = '<li class="meta-data"><time data-datetime="relative" datetime="2022-01-30T08:56:09Z" title="2022-01-30T08:56:09Z">January 30, 2022 08:56</time></li>'

datetime= r"(\d{1,4}-\d{1,2}-\d{1,2}T\d{1,2}:\d{1,2}:\d{1,2}Z)"

output = re.findall(datetime, string)

#output:

['2022-01-30T08:56:09Z', '2022-01-30T08:56:09Z']

Don't use find_all<\/code> but find<\/code> because there is only one entry in each page and extract time<\/code> markup and not li<\/code> :

def okx_scrap():

    b = []
    url = 'https://www.okex.com/support/hc/en-us/sections/360000030652-Latest-Announcements'
    page = requests.get(url)
    soup = BeautifulSoup(page.content,'html.parser')
    small_soup = soup.find_all(class_ = "article-list-link")
    url_1st = 'https://www.okex.com/support'

        #Getting Yesterday's Date

    for i in small_soup:
        full_url = url_1st +(i['href'])
        page2 = requests.get(full_url)
        soup2 = BeautifulSoup(page2.content,'html.parser')
        print(soup2.find('time')['datetime'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM