[英]A better method than readlines?
Using Python 2.5, I am reading an HTML file for three different pieces of information. 使用Python 2.5,我正在读取HTML文件,以获取三种不同的信息。 The way I am able to find information is by finding a match with regex * and then counting a specific number of lines down from the matching line to get the actual information I'm looking for.
我能够找到信息的方式是找到与正则表达式 *的匹配项,然后从匹配行开始向下计数特定的行数,以获取我正在寻找的实际信息。 The problem is I to have to re-open the site 3 times (one for each piece of info I'm looking up).
问题是我必须重新打开该站点3次(对于我正在查找的每条信息,均应打开一次)。 I think it's inefficient and want to be able to look up all three things opening the site only once.
我认为这效率低下,希望只查找一次打开站点的所有三件事。 Does anyone have a better method or suggestion?
有谁有更好的方法或建议?
* I will learn a better way, such as BeautifulSoup, but for now, I need a quick fix * 我将学习更好的方法,例如BeautifulSoup,但就目前而言,我需要快速修复
Code: 码:
def scrubdividata(ticker):
try:
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
end = '</td>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass
Thanks, 谢谢,
B 乙
I found a solution that works! 我找到了可行的解决方案! I deleted the two extraneous urlopen and readlines commands, leaving only one for the loop (before I was only deleting the urlopen commands, but leaving readlines).
我删除了两个无关的urlopen和readlines命令,只为循环留了一个(在我只删除urlopen命令之前,但保留了readlines)。 Here is my corrected code:
这是我的更正代码:
def scrubdividata(ticker):
try:
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
#f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
#lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
#f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
#lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
end = '</td>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
print ticker,LastDiv,AnnualDiv,LastExDivDate
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass
BeautifulSoup example for reference (Python2 from memory: I only have it for Python3 here so some of the syntax may be off a bit): BeautifulSoup示例供参考(内存中的Python2:在这里我仅将其用于Python3,因此某些语法可能会有所偏离):
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
yoursite = "http://...."
with urlopen(yoursite) as f:
soup = BeautifulSoup(f)
for node in soup.findAll('td', attrs={'class':'descrip'}):
print node.text
print node.next_sibling.next_sibling.text
Outputs (for sample input 'GOOG'): 输出(对于示例输入“ GOOG”):
Last Close:
$910.68
Annual Dividend:
N/A
Pay Date:
N/A
Dividend Yield:
N/A
Ex-Dividend Date:
N/A
Years Paying:
N/A
52 Week Dividend:
$0.00
etc.
BeautifulSoup can be easy to use on sites that have a predictable schema. 在具有可预测模式的网站上,BeautifulSoup易于使用。
def scrubdividata(ticker):
try:
end = '</td>'
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass
Note that lines
will contain the lines that you need, so there is no need to call f.readlines()
again. 需要注意的是
lines
会包含你所需要的线,所以没有必要调用f.readlines()
一次。 Simply reuse lines
只需重复使用
lines
Small notes: you can use for line in lines
: 小提示:您可以
for line in lines
使用for line in lines
:
def scrubdividata(ticker):
try:
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for line in lines:
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
for line in lines:
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
for line in lines:
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
end = '</td>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.