简体   繁体   English

比阅读线更好的方法?

[英]A better method than readlines?

Using Python 2.5, I am reading an HTML file for three different pieces of information. 使用Python 2.5,我正在读取HTML文件,以获取三种不同的信息。 The way I am able to find information is by finding a match with regex * and then counting a specific number of lines down from the matching line to get the actual information I'm looking for. 我能够找到信息的方式是找到与正则表达式 *的匹配项,然后从匹配行开始向下计数特定的行数,以获取我正在寻找的实际信息。 The problem is I to have to re-open the site 3 times (one for each piece of info I'm looking up). 问题是我必须重新打开该站点3次(对于我正在查找的每条信息,均应打开一次)。 I think it's inefficient and want to be able to look up all three things opening the site only once. 我认为这效率低下,希望只查找一次打开站点的所有三件事。 Does anyone have a better method or suggestion? 有谁有更好的方法或建议?

* I will learn a better way, such as BeautifulSoup, but for now, I need a quick fix * 我将学习更好的方法,例如BeautifulSoup,但就目前而言,我需要快速修复

Code: 码:

def scrubdividata(ticker):
try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

Thanks, 谢谢,

B

I found a solution that works! 我找到了可行的解决方案! I deleted the two extraneous urlopen and readlines commands, leaving only one for the loop (before I was only deleting the urlopen commands, but leaving readlines). 我删除了两个无关的urlopen和readlines命令,只为循环留了一个(在我只删除urlopen命令之前,但保留了readlines)。 Here is my corrected code: 这是我的更正代码:

def scrubdividata(ticker):
try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    #lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
    #f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    #lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
    print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
print ticker,LastDiv,AnnualDiv,LastExDivDate
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

BeautifulSoup example for reference (Python2 from memory: I only have it for Python3 here so some of the syntax may be off a bit): BeautifulSoup示例供参考(内存中的Python2:在这里我仅将其用于Python3,因此某些语法可能会有所偏离):

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

yoursite = "http://...."
with urlopen(yoursite) as f:
    soup = BeautifulSoup(f)

    for node in soup.findAll('td', attrs={'class':'descrip'}):
        print node.text
        print node.next_sibling.next_sibling.text

Outputs (for sample input 'GOOG'): 输出(对于示例输入“ GOOG”):

Last Close:
$910.68
Annual Dividend:
N/A
Pay Date:
N/A
Dividend Yield:
N/A
Ex-Dividend Date:
N/A
Years Paying:
N/A
52 Week Dividend:
$0.00
etc.

BeautifulSoup can be easy to use on sites that have a predictable schema. 在具有可预测模式的网站上,BeautifulSoup易于使用。

def scrubdividata(ticker):
try:
    end = '</td>'
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

Note that lines will contain the lines that you need, so there is no need to call f.readlines() again. 需要注意的是lines会包含你所需要的线,所以没有必要调用f.readlines()一次。 Simply reuse lines 只需重复使用lines

Small notes: you can use for line in lines : 小提示:您可以for line in lines使用for line in lines

def scrubdividata(ticker):
  try:
    f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
    lines = f.readlines()
    for line in lines:
        if "Annual Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

    for line in lines:
        if "Last Dividend:" in line:
            s = str(lines[i+1])
            start = '>\$'
            end = '</td>'
            LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

    for line in lines:
        if "Last Ex-Dividend Date:" in line:
            s = str(lines[i+1])
            start = '>'
            end = '</td>'
            LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
    divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
  except:
    if ticker not in errorlist:
        errorlist.append(ticker)
    else:
        pass
    pass

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM