比阅读线更好的方法？

Question

Using Python 2.5, I am reading an HTML file for three different pieces of information. 使用Python 2.5，我正在读取HTML文件，以获取三种不同的信息。 The way I am able to find information is by finding a match with regex * and then counting a specific number of lines down from the matching line to get the actual information I'm looking for. 我能够找到信息的方式是找到与正则表达式 *的匹配项，然后从匹配行开始向下计数特定的行数，以获取我正在寻找的实际信息。 The problem is I to have to re-open the site 3 times (one for each piece of info I'm looking up). 问题是我必须重新打开该站点3次（对于我正在查找的每条信息，均应打开一次）。 I think it's inefficient and want to be able to look up all three things opening the site only once. 我认为这效率低下，希望只查找一次打开站点的所有三件事。 Does anyone have a better method or suggestion? 有谁有更好的方法或建议？

* I will learn a better way, such as BeautifulSoup, but for now, I need a quick fix * 我将学习更好的方法，例如BeautifulSoup，但就目前而言，我需要快速修复

Code: 码：

def scrubdividata(ticker):
try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

Thanks, 谢谢，

B 乙

I found a solution that works! 我找到了可行的解决方案！ I deleted the two extraneous urlopen and readlines commands, leaving only one for the loop (before I was only deleting the urlopen commands, but leaving readlines). 我删除了两个无关的urlopen和readlines命令，只为循环留了一个（在我只删除urlopen命令之前，但保留了readlines）。 Here is my corrected code: 这是我的更正代码：

def scrubdividata(ticker):
try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    #lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    #lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
    print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
print ticker,LastDiv,AnnualDiv,LastExDivDate
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

Answer 1

BeautifulSoup example for reference (Python2 from memory: I only have it for Python3 here so some of the syntax may be off a bit): BeautifulSoup示例供参考（内存中的Python2：在这里我仅将其用于Python3，因此某些语法可能会有所偏离）：

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

yoursite = "http://...."
with urlopen(yoursite) as f:
    soup = BeautifulSoup(f)

    for node in soup.findAll('td', attrs={'class':'descrip'}):
        print node.text
        print node.next_sibling.next_sibling.text

Outputs (for sample input 'GOOG'): 输出（对于示例输入“ GOOG”）：

Last Close:
$910.68
Annual Dividend:
N/A
Pay Date:
N/A
Dividend Yield:
N/A
Ex-Dividend Date:
N/A
Years Paying:
N/A
52 Week Dividend:
$0.00
etc.

BeautifulSoup can be easy to use on sites that have a predictable schema. 在具有可预测模式的网站上，BeautifulSoup易于使用。

Answer 2

def scrubdividata(ticker):
try:
    end = '</td>'
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

Answer 3

Note that lines will contain the lines that you need, so there is no need to call f.readlines() again. 需要注意的是lines会包含你所需要的线，所以没有必要调用f.readlines()一次。 Simply reuse lines 只需重复使用lines

Small notes: you can use for line in lines : 小提示：您可以for line in lines使用for line in lines ：

def scrubdividata(ticker):
  try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for line in lines:
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

    for line in lines:
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

    for line in lines:
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
  except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

比阅读线更好的方法？

问题描述

3 个解决方案

解决方案1
1 2013-07-22 20:00:34

解决方案2
0 2013-07-22 19:10:52

解决方案3
0 2013-07-22 19:15:03

比阅读线更好的方法？

问题描述

3 个解决方案

解决方案1 1 2013-07-22 20:00:34

解决方案2 0 2013-07-22 19:10:52

解决方案3 0 2013-07-22 19:15:03

解决方案1
1 2013-07-22 20:00:34

解决方案2
0 2013-07-22 19:10:52

解决方案3
0 2013-07-22 19:15:03