简体   繁体   English

新闻抓取日期

[英]Scraping Date of News

I am trying to do scraping from https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya .我正在尝试从https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya进行抓取。 I am trying to scrape the date of news, here's my code:我正在尝试抓取新闻的日期,这是我的代码:

news['tanggal'] = newsScrape['date']
dates = []
for x in news['tanggal']:
    x = listToString(x)
    x = x.strip()
    x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
    dates.append(x)
dates = listToString(dates)
dates = dates[0:20]
if len(dates) == 0:
    continue
news['tanggal'] = dt.datetime.strptime(dates, '%d %B %Y, %H:%M')

but I got this error:但我收到了这个错误:

ValueError: time data '06 Mei 2021, 11:32  ' does not match format '%d %B %Y, %H:%M'

My assumption is because Mei is in Indonesian language, meanwhile the format need May which is in English.我的假设是因为Mei是印尼语,同时格式需要May是英语。 How to change Mei to be May ?怎么把Mei变成May I have tried dates = dates.replace('Mei', 'May') but it doesnt work on me.我已经尝试过dates = dates.replace('Mei', 'May')但它对我不起作用。 When I tried it, I got error ValueError: unconverted data remains: The type of dates is string .当我尝试它时,我得到了错误ValueError: unconverted data remains: The type of dates is string Thanks谢谢

Your assumption regarding the May -> Mei change is correct, the reason you're likely facing a problem after the replacement are the trailing spaces in your string, which are not accounted for in your format.您对 May -> Mei 更改的假设是正确的,替换后您可能会遇到问题的原因是您的字符串中的尾随空格,这在您的格式中没有考虑。 You can use string.rstrip() to remove these spaces.您可以使用string.rstrip()删除这些空格。

import datetime as dt

dates = "06 Mei 2021, 11:32  "
dates = dates.replace("Mei", "May") # The replacement will have to be handled for all months, this is only an example
dates = dates.rstrip()
date = dt.datetime.strptime(dates, "%d %B %Y, %H:%M")
print(date) # 2021-05-06 11:32:00

While this does fix the problem here, it's messy to have to shorten the string like this after dates = dates[0:20] .虽然这确实解决了这里的问题,但在dates = dates[0:20]之后必须像这样缩短字符串很麻烦。 Consider using regex to gain the appropriate format at once.考虑使用正则表达式一次获得适当的格式。

The problem seems to be just the trailing white space you have, which explains the error ValueError: unconverted data remains: .问题似乎只是您拥有的尾随空格,这解释了错误ValueError: unconverted data remains: It is complaining that it is unable to convert the remaining data (whitespace).它抱怨它无法转换剩余的数据(空白)。

s = '06 Mei 2021, 11:32  '.replace('Mei', 'May').strip()
datetime.strptime(s, '%d %B %Y, %H:%M')
# Returns datetime.datetime(2021, 5, 6, 11, 32)

Also, to convert all the Indonesian months to English, you can use a dictionary:此外,要将所有印度尼西亚月份转换为英语,您可以使用字典:

id_en_dict = {
    ...,
    'Mei': 'May',
    ...
}

You can try with the following您可以尝试以下方法

import datetime as dt
import requests
from bs4 import BeautifulSoup
import urllib.request

url="https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya"
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.content, 'html.parser')
info_soup= soup.find(class_="new-description")
x=info_soup.find('span').get_text(strip=True)
x = x.strip()
x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
x = x[0:20]
x = x.rstrip()
date= dt.datetime.strptime(x.replace('Mei', 'May'), '%d %B %Y, %H:%M')
print(date)

result:结果:

2021-05-06 11:45:00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM