[英]Python webscraping with beautifulsoup
我想从这个html网站上用beautifulsoup刮擦股票行情。 查看源代码: https : //www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&pagesize=40
我想从H3元素中获取股票行情。 就像“ PIH”
<td>
<h3>
<a href="/symbol/pih">
PIH</a>
</h3>
</td>
到目前为止,我已经尝试过了:
resp = requests.get('https://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&pagesize=40')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('div', {'class': 'genTable thin'})
tickers = []
for row in table.findAll('tr'):
ticker = row.findAll('h3')
tickers.append(ticker)
我得到的结果是:
[[], [<h3>
<a href="/symbol/yi">
YI</a>
</h3>], [], [<h3>
<a href="/symbol/pih">
PIH</a>
</h3>], [], [<h3>
<a href="/symbol/pihpp">
PIHPP</a>
</h3>], [], [<h3>
<a href="/symbol/turn">
TURN</a>
您可以使用以下代码,
import requests
from bs4 import BeautifulSoup
resp = requests.get('https://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&pagesize=40')
soup = BeautifulSoup(resp.text, 'lxml')
table = soup.find('div', {'class': 'genTable thin'})
tickers = []
for row in table.findAll('tr'):
for ticker in row.findAll('h3'):
tickers.append(ticker.text.strip()) # extract the text from the element and strip out extra spaces, escape sequences.
print(tickers)
输出:
['YI', 'PIH', 'PIHPP', 'TURN', 'FLWS', 'BCOW', 'FCCY', 'SRCE', 'VNET', 'TWOU', 'QFIN', 'JOBS', 'JFK', 'JFKKR', 'JFKKU', 'JFKKW', 'EGHT', 'JFU', 'AAON', 'ABEO', 'ABEOW', 'ABIL', 'ABMD', 'AXAS', 'ACIU', 'ACIA', 'ACTG', 'ACHC', 'ACAD', 'ACAM', 'ACAMU', 'ACAMW', 'ACST', 'AXDX', 'XLRN', 'ARAY', 'ACRX', 'ACER', 'ACHV', 'ACHN']
看到它在这里行动。
您可以像这样提取文本
resp = requests.get('https://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&pagesize=40')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('div', {'class': 'genTable thin'})
tickers = []
for row in table.findAll('tr'):
ticker = row.findAll('h3')
if ticker == []:
pass
else:
print(ticker[0].text.strip())
使用CSS选择器和列表理解的另一种解决方案:
import requests
from bs4 import BeautifulSoup
resp = requests.get('https://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&pagesize=40')
soup = BeautifulSoup(resp.text, 'lxml')
tickers = [h3.get_text(strip=True) for h3 in soup.select('#CompanylistResults h3')]
from pprint import pprint
pprint(tickers)
印刷品:
['YI',
'PIH',
'PIHPP',
'TURN',
'FLWS',
'BCOW',
'FCCY',
'SRCE',
'VNET',
...and so on.
您可以使用re
查找代码链接并提取文本:
import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&pagesize=40').text, 'html.parser')
r = [*{i['href'].split('/')[-1].upper() for i in d.find_all('a', {'href':re.compile('/symbol/\w+$')})}]
输出:
['PIH', 'JFK', 'FCCY', 'AXAS', 'FLWS', 'ACHC', 'ACIA', 'AXDX', 'JFKKW', 'ACAM', 'ACST', 'XLRN', 'ACTG', 'JFU', 'ABEOW', 'ABIL', 'ACER', 'ACIU', 'YI', 'ABEO', 'ACAMW', 'ACRX', 'ACHN', 'JOBS', 'ARAY', 'PIHPP', 'TWOU', 'ACAMU', 'ACHV', 'AAON', 'QFIN', 'SRCE', 'VNET', 'BCOW', 'JFKKR', 'ACAD', 'TURN', 'EGHT', 'JFKKU', 'ABMD']
要过滤带有非字母数字字符的代码,可以使用re
:
results = [i for i in r if re.findall('^[A-Z]+$', i)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.