抓取时移至下一页

Question

在网络抓取和更改日期格式时移至下一页

url_list是网址列表，其中一个是http://www.moneycontrol.com/company-article/cadilahealthcare/news/CHC#CHC我发现要移至不同的年份和不同的页面，会有一个href代码，但我似乎无法使用它。 这是从第1页提取链接的代码。我想在所有可用的年份和页面中都这样做。

另外，当我从html中提取日期时，其格式为[上次更新时间：2019年2月7日03:05 PM IST | 资料来源：Moneycontrol.com]我希望使用mm / dd / yy格式的日期，我也将如何做呢？

for urls in url_list:
    html = requests.get(urls)
    soup = BeautifulSoup(html.text,'html.parser') # Create a BeautifulSoup object 

       # Retrieve a list of all the links and the titles for the respective links
       #word1,word2,word3 = "US","USA","USFDA"

    sub_links = soup.find_all('a', class_='arial11_summ')
    for links in sub_links:
        sp = BeautifulSoup(str(links),'html.parser')  # first convert into a string
        tag = sp.a
          #if word1 in tag['title'] or word2 in tag['title'] or word3 in tag['title']:
        category_links = Base_url + tag["href"]
        List_of_links.append(category_links)
        time.sleep(3)

我要执行的操作是先刮掉第一页，然后再移动到下一页，依此类推，在将特定年份的可用页面刮掉之后，代码将移至下一年。 请解释一下我将如何去做。

Answer 1

移至下一页：

将参数添加到URL像这样https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC＆durationType = Y＆Year = 2018
对于年份列表，您可以从第一页获得

提取日期：子字符串仅获取日期时间，然后像这样解析时间和时区

我使用pytz更新了设置的时区

input = 'Feb 07, 2019 03:05 PM IST'
str_time = input[:len(input) - 4]
str_timezone = input[len(input) - 3:]

datetime_object = datetime.strptime(str_time, '%b %d, %Y %I:%M %p')
if str_timezone == 'IST':
    # base on https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
    # assume it's Indian/Mauritius
    tz = pytz.timezone('Indian/Mauritius')
else:
    tz = pytz.timezone('UTC')

output = tz.localize(datetime_object)
# test
print(output.strftime('%X %x %z'))

抓取时移至下一页

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-02-20 08:13:23

抓取时移至下一页

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-02-20 08:13:23

解决方案1
2 已采纳 2019-02-20 08:13:23