繁体   English   中英

Python:通过<a>具有href和文本内容的</a> html文件抓取<a>标签进行</a>搜索

[英]Python: searching through html file grabbing <a> tags with the href and text content

我需要一个解决方案的帮助,以使用Python3搜索html文件并检索页面上的所有<a>链接。 然后将获取的值附加到带有相邻href(url)的字典中。

这就是我已经尝试过的。

import urllib3
import re

http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET",my_url)
html = a.data

links = re.finditer(' href="?([^\s^"]+)', html)

for link in links:
  print(link)

我收到此错误...

TypeError: can't use a string pattern on a bytes-like object

谢谢你的帮助。

我也尝试过lxml ...

links = lxml.html.parse("http://www.google.co.uk/?gws_rd=ssl#q=apple+stock&tbm=nws").xpath("//a/@href")
for link in links:
    print(link)

结果未显示所有链接,我不确定为什么。

更新:

新代码=>

    def news_feed(self, stock):
    http = urllib3.PoolManager()
    my_url = "https://in.finance.yahoo.com/q/h?s="+stock
    a = http.request("GET",my_url)
    html = a.data.decode('utf-8')
    xml = fromstring(html, HTMLParser())
    a_tags = xml.xpath("//a/@href")
    xml = fromstring(html, HTMLParser())
    a_tags = xml.xpath("//table[@id='yfncsumtab']//a")
    self.paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
    pp(self.paired)

使用html解析器并按照建议的方式解码字节, BeautifulSoup将使这项工作非常容易,并且在解析html时,它比正则表达式更可靠:

http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET", my_url)
html = a.data.decode("utf-8")

from bs4 import BeautifulSoup

print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)])

如果只希望链接以http开头,则可以使用CSS选择:

soup = BeautifulSoup(html)

print([a["href"] for a in soup.select("a[href^=http]")])

这会给你:

['https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://help.yahoo.com/l/in/yahoo/finance/', 'http://in.yahoo.com/bin/set?cmp=uheader&src=others', 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 'http://in.my.yahoo.com', 'https://in.yahoo.com/', 'https://in.finance.yahoo.com', 'https://in.finance.yahoo.com/investing/', 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 'https://in.finance.yahoo.com/news/apple-sees-first-sales-dip-011402926.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-031840725.html', 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote', 'http://www.capitaliq.com', 'http://www.csidata.com', 'http://www.morningstar.com/']

要获取文本和href:

soup = BeautifulSoup(html)

a_tags = soup.select("a[href^=http]")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)

pp(paired)

输出:

 {u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u'Capital IQ': 'http://www.capitaliq.com',
 u'Commodity Systems, Inc. (CSI)': 'http://www.csidata.com',
 u'Download the new Yahoo Mail app': 'https://in.mobile.yahoo.com/mail/',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Help': 'https://help.yahoo.com/l/in/yahoo/finance/',
 u'Mail': 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN',
 u'Markets': 'https://in.finance.yahoo.com/investing/',
 u'Morningstar, Inc.': 'http://www.morningstar.com/',
 u'My Yahoo': 'http://in.my.yahoo.com',
 u'New User? Register': 'https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u'Report an Issue': 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u'Sign In': 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html',
 u'Yahoo': 'https://in.yahoo.com/',
 u'Yahoo India Finance': 'https://in.finance.yahoo.com',
 u'other exchanges': 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html',
 u'premium service.': 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote'}

a[href^=http]意思是给我所有具有href的a标记,并且这些href的值以http开头。

使用lxml并使用表ID获得您可能最感兴趣的故事链接:

from lxml.etree  import fromstring, HTMLParser

xml = fromstring(_html, HTMLParser())

a_tags = xml.xpath("//table[@id='yfncsumtab']//a")

paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

给你:

{'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 "Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 "EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

我们可以不选择而做同样的事情:

soup = BeautifulSoup(_html)

a_tags = soup.select("#yfncsumtab a")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)
pp(paired)

这将与我们的lxml输出匹配:

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

您可以只使用//*[@id='yfncsumtab']//a因为id应该是唯一的。

要使用xpath从表中获取前六个链接,我们可以使用ul,并使用ul[position() < 7]提取前六个链接:

a_tags  = xml.xpath("//*[@id='yfncsumtab']//ul[position() < 7]//a")

paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

这会给你:

{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html'}

对于小桌子,您也可以简单地切片。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM