简体   繁体   中英

Scraping Date of News

I am trying to do scraping from https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya . I am trying to scrape the date of news, here's my code:

news['tanggal'] = newsScrape['date']
dates = []
for x in news['tanggal']:
    x = listToString(x)
    x = x.strip()
    x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
    dates.append(x)
dates = listToString(dates)
dates = dates[0:20]
if len(dates) == 0:
    continue
news['tanggal'] = dt.datetime.strptime(dates, '%d %B %Y, %H:%M')

but I got this error:

ValueError: time data '06 Mei 2021, 11:32  ' does not match format '%d %B %Y, %H:%M'

My assumption is because Mei is in Indonesian language, meanwhile the format need May which is in English. How to change Mei to be May ? I have tried dates = dates.replace('Mei', 'May') but it doesnt work on me. When I tried it, I got error ValueError: unconverted data remains: The type of dates is string . Thanks

Your assumption regarding the May -> Mei change is correct, the reason you're likely facing a problem after the replacement are the trailing spaces in your string, which are not accounted for in your format. You can use string.rstrip() to remove these spaces.

import datetime as dt

dates = "06 Mei 2021, 11:32  "
dates = dates.replace("Mei", "May") # The replacement will have to be handled for all months, this is only an example
dates = dates.rstrip()
date = dt.datetime.strptime(dates, "%d %B %Y, %H:%M")
print(date) # 2021-05-06 11:32:00

While this does fix the problem here, it's messy to have to shorten the string like this after dates = dates[0:20] . Consider using regex to gain the appropriate format at once.

The problem seems to be just the trailing white space you have, which explains the error ValueError: unconverted data remains: . It is complaining that it is unable to convert the remaining data (whitespace).

s = '06 Mei 2021, 11:32  '.replace('Mei', 'May').strip()
datetime.strptime(s, '%d %B %Y, %H:%M')
# Returns datetime.datetime(2021, 5, 6, 11, 32)

Also, to convert all the Indonesian months to English, you can use a dictionary:

id_en_dict = {
    ...,
    'Mei': 'May',
    ...
}

You can try with the following

import datetime as dt
import requests
from bs4 import BeautifulSoup
import urllib.request

url="https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya"
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.content, 'html.parser')
info_soup= soup.find(class_="new-description")
x=info_soup.find('span').get_text(strip=True)
x = x.strip()
x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
x = x[0:20]
x = x.rstrip()
date= dt.datetime.strptime(x.replace('Mei', 'May'), '%d %B %Y, %H:%M')
print(date)

result:

2021-05-06 11:45:00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM