[英]Issues with my first Python Web Scraper
I'm writing my first python web scraper and I'm having trouble writing the code to get it to scrape the data I want. 我正在编写我的第一个python Web爬虫,但在编写代码以获取它来爬取所需数据时遇到了麻烦。
Here is my code so far: 到目前为止,这是我的代码:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen ('http://finviz.com/screener.ashx?v=340&s=ta_topgainers')
soup = bs.BeautifulSoup(source, "html.parser")
#Ticker = 'quote.ashx?t'
print (Ticker)
What I want to pull from the website is this section of code: 我想从网站上提取的是这段代码:
<a href="quote.ashx?t=ETRM&ty=c&p=d&b=1">
This is the entire line but I'm only interested in the section above: 这是整行,但是我只对上面的部分感兴趣:
<a href="quote.ashx?t=ETRM&ty=c&p=d&b=1"><img src="chart.ashx?t=ETRM&ta=1&ty=c&p=d&s=l" alt="" width="700" height="340" border="0"/></a></td>
Specifically I want to pull the ticker symbol, which in this case is $ETRM. 具体来说,我想拉股票代码,在这种情况下为$ ETRM。 I would like to pull all the ticker symbols from the page above that are in the format above. 我想从上面的页面中提取上面格式的所有股票代码。
I tried isolating the quote.ashx?t
but it's just returning the entire source code of the page. 我试图隔离quote.ashx?t
但它只是返回页面的整个源代码。
You can locate the desired link by partially matching the href
value with a CSS selector : 您可以通过将href
值与CSS选择器 部分匹配来找到所需的链接:
link = soup.select_one("a[href*=ETRM]")
print(link["href"])
soup.select('a[href^="quote.ashx?t"]') # select a tag which have href starts with quote.ashx?t
out: 出:
[<a href="quote.ashx?t=ETRM&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=ETRM&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=ETRM&ty=c&p=d&b=1">ETRM</a>,
<a href="quote.ashx?t=SSY&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=SSY&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=SSY&ty=c&p=d&b=1">SSY</a>,
<a href="quote.ashx?t=PTX&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=PTX&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=PTX&ty=c&p=d&b=1">PTX</a>,
<a href="quote.ashx?t=ZFGN&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=ZFGN&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=ZFGN&ty=c&p=d&b=1">ZFGN</a>,
<a href="quote.ashx?t=JTPY&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=JTPY&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=JTPY&ty=c&p=d&b=1">JTPY</a>,
<a href="quote.ashx?t=ARWR&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=ARWR&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=ARWR&ty=c&p=d&b=1">ARWR</a>,
<a href="quote.ashx?t=PCRX&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=PCRX&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=PCRX&ty=c&p=d&b=1">PCRX</a>,
<a href="quote.ashx?t=ATOS&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=ATOS&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=ATOS&ty=c&p=d&b=1">ATOS</a>,
<a href="quote.ashx?t=QTNT&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=QTNT&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=QTNT&ty=c&p=d&b=1">QTNT</a>,
<a href="quote.ashx?t=GBX&ty=c&p=d&b=1"><img alt="" border="0" height="340" src="chart.ashx?t=GBX&ta=1&ty=c&p=d&s=l" width="700"/></a>,
<a class="tab-link" href="quote.ashx?t=GBX&ty=c&p=d&b=1">GBX</a>]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.