將Regex與BeautifulSoup結合使用可在Python中解析字符串

Question

我有一系列類似於“ 2014年12月27日星期六”的字符串，我想扔掉“星期六”並保存名稱為“ 141227”的文件，即年+月+日。 到目前為止，一切工作正常，但我無法使Daypos或Yearpos的正則表達式正常工作。 它們都給出相同的錯誤：

追溯（最近一次通話最近）：文件“ scrapewaybackblog.py”，第17行，在daypos = byline.find（re.compile（“ [AZ] [az] * \\ s”）））TypeError：預期為字符緩沖區對象

什么是字符緩沖區對象？ 那是否表示我的表情有問題？ 這是我的腳本：

for i in xrange(3, 1, -1):
       page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
       soup = BeautifulSoup(page.read())
       snippet = soup.find_all('div', attrs={'class': 'blog-box'})
       for div in snippet:
           byline =  div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
           text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

           monthpos = byline.find(",")
           daypos = byline.find(re.compile("[A-Z][a-z]*\s"))
           yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s"))
           endpos = monthpos + len(byline)

           month = byline[monthpos+1:daypos]
           day = byline[daypos+0:yearpos]
           year = byline[yearpos+2:endpos]

           output_files_pathname = 'Data/'  # path where output will go
           new_filename = year + month + day + ".txt"
           outfile = open(output_files_pathname + new_filename,'w')
           outfile.write(date)
           outfile.write("\n")
           outfile.write(text)
           outfile.close()
       print "finished another url from page {}".format(i)

我還沒有弄清楚如何使12月= 12，但這是另一次。 請幫我找到合適的位置。

Answer 1

與其使用regex解析日期字符串， dateutil使用dateutil解析日期字符串：

from dateutil.parser import parse

for div in soup.select('div.blog-box'):
    byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
    text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

    dt = parse(byline)
    new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
    ...

或者，您可以使用datetime.strptime()解析字符串，但需要注意后綴：

byline = re.sub(r"(?<=\d)(st|nd|rd|th)", "", byline)
dt = datetime.strptime(byline, '%A, %B %d %Y')

re.sub()在此處找到數字后的 st或nd或rd或th字符串，並將后綴替換為空字符串。 之后，日期字符串將與'%A, %B %d %Y'格式匹配，請參見：

strftime（）和strptime（）行為

一些附加說明：

您可以將urlopen()的結果直接傳遞給BeautifulSoup構造函數
而不是按類名使用find_all() ，請使用CSS選擇器 div.blog-box
要加入系統路徑，請使用os.path.join()
處理文件時with上下文管理器一起使用

固定版本：

import os
import urllib2

from bs4 import BeautifulSoup
from dateutil.parser import parse


for i in xrange(3, 1, -1):
    page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
    soup = BeautifulSoup(page)

    for div in soup.select('div.blog-box'):
        byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
        text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

        dt = parse(byline)

        new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
        with open(os.path.join('Data', new_filename), 'w') as outfile:
            outfile.write(byline)
            outfile.write("\n")
            outfile.write(text)

    print "finished another url from page {}".format(i)

將Regex與BeautifulSoup結合使用可在Python中解析字符串

問題描述

1 個解決方案

解決方案1
5 已采納 2014-12-28 00:23:27

將Regex與BeautifulSoup結合使用可在Python中解析字符串

問題描述

1 個解決方案

解決方案1 5 已采納 2014-12-28 00:23:27

解決方案1
5 已采納 2014-12-28 00:23:27