简体   繁体   English

如何从<a href>标签中</a>获取信息<div> <a href>BeautifulSoup 和 Python 的标签?</a>

[英]How can I get information from an <a href> tag within <div> tags with BeautifulSoup and Python?

all.全部。 I have a quick question about BeautifulSoup with Python.我有一个关于使用 Python 的 BeautifulSoup 的快速问题。 I have several bits of HTML that look like this (the only differences are the links and product names) and I'm trying to get the link from the "href" attribute.我有几段 HTML 看起来像这样(唯一的区别是链接和产品名称),我正在尝试从“href”属性中获取链接。

<div id="productListing1" xmlns:dew="urn:Microsoft.Search.Response.Document">
<span id="rank" style="display:none;">94.36</span>
<div class="productPhoto">
    <img src="/assets/images/ocpimages/87684/00131cl.gif" height="82" width="82" />
</div>
<div class="productName">
    <a class="on" href="/Products/ProductInfoDisplay.aspx?SiteId=1&amp;Product=8768400131">CAPRI SUN - JUICE DRINK - COOLERS VARIETY PACK 6 OZ</a>
</div>
<div class="size">40 CT</div>

I currently have this Python code:我目前有这个 Python 代码:

productLinks = soup.findAll('a', attrs={'class' : 'on'})
for link in productLinks:
    print link['href']

This works (for every link on the page I get something like /Products/ProductInfoDisplay.aspx?SiteId=1&amp;Product=8768400131 );这有效(对于页面上的每个链接,我都会得到类似/Products/ProductInfoDisplay.aspx?SiteId=1&amp;Product=8768400131 ); however, I've been trying to figure out if there's a way to get the link in the "href" attribute without searching explicitly for 'class="on"'.但是,我一直在尝试弄清楚是否有办法在“href”属性中获取链接,而无需明确搜索“class =“on””。 I guess my first question should be whether or not this is the best way to find this information (class="on" seems too generic and likely to break in the future although my CSS and HTML skills aren't that good).我想我的第一个问题应该是这是否是查找此信息的最佳方式(尽管我的 CSS 和 HTML 技能不是很好,但 class="on" 似乎太笼统并且将来可能会中断)。 I've tried numerous combinations of find, findAll, findAllnext, etc. methods but I can't quite make it work.我已经尝试了多种 find、findAll、findAllnext 等方法的组合,但我无法让它发挥作用。 This is mostly what I had (I rearranged and changed it numerous times):这主要是我所拥有的(我重新排列并更改了很多次):

productLinks = soup.find('div', attrs={'class' : 'productName'}).find('a', href=True)

If this isn't a good way to do this, how can I get to the <a> tag from the <div class="productName"> tag?如果这不是一个好方法,我怎样才能从<div class="productName">标签到达<a>标签? Let me know if you need more information.如果您需要更多信息,请与我们联系。

Thank you.谢谢你。

Well, once you have the <div> , element, you can get the <a> subelement by calling find() :好吧,一旦有了<div>元素,就可以通过调用find()来获取<a>子元素:

productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
    print div.find('a')['href']

However, since the <a> is immediately above <div> , you can get the a attribute from the div:但是,由于<a>就在<div>之上,您可以从 div 中获取a属性:

productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
    print div.a['href']

Now, if you want to put all the <a> elements in a list, your code above will not work because find() just returns one element matched by its criteria.现在,如果你想把所有的<a>元素放在一个列表中,你上面的代码将不起作用,因为find()只返回一个与其条件匹配的元素。 You would get the list of divs and get the subelements from them, for example, using list comprehensions:您将获取 div 列表并从中获取子元素,例如,使用列表推导式:

productLinks = [div.a for div in 
        soup.findAll('div', attrs={'class' : 'productName'})]
for link in productLinks:
    print link['href']

I am giving this solution in BeautifulSoup4我在 BeautifulSoup4 中给出了这个解决方案

for data in soup.find_all('div', class_='productName'):
    for a in data.find_all('a'):
        print(a.get('href')) #for getting link
        print(a.text) #for getting text between the link
You can avoid those for loops by specifying the index. 您可以通过指定索引来避免那些 for 循环。
 data = soup.find_all('div', class_='productName') a_class = data[0].find_all('a') url_ = a_class[0].get('href') print(url_)

我怎样才能分开这些<div id="text_translate"><p>我正在抓取一个网站,但我很难理解。</p><p> 我试图将标签分成两组,所以当我运行 for 循环时,它应该是:</p><pre> # Group 1 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-19&amp;is_playoff_game=N" data-stat="game_season"><strong>1</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210190MIA.html">2022-10-19</a></td> <td class="right" data-stat="age">25-093</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/CHI/2023.html">CHI</a></td> <td class="center" csk="-8" data-stat="game_result">L (-8)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2040" data-stat="mp">34:00</td> <td class="right" data-stat="fg">5</td> <td class="right" data-stat="fga">15</td> <td class="right" data-stat="fg_pct">.333</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">2</td> <td class="right" data-stat="fta">3</td> <td class="right" data-stat="ft_pct">.667</td> <td class="right" data-stat="orb">1</td> <td class="right" data-stat="drb">8</td> <td class="right" data-stat="trb">9</td> <td class="right" data-stat="ast">2</td> <td class="right iz" data-stat="stl">0</td> <td class="right" data-stat="blk">1</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">12</td> <td class="right" data-stat="game_score">1.7</td> <td class="right" data-stat="plus_minus">-15</td> # Group 2 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-21&amp;is_playoff_game=N" data-stat="game_season"><strong>2</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210210MIA.html">2022-10-21</a></td> <td class="right" data-stat="age">25-095</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/BOS/2023.html">BOS</a></td> <td class="center" csk="-7" data-stat="game_result">L (-7)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2093" data-stat="mp">34:53</td> <td class="right" data-stat="fg">8</td> <td class="right" data-stat="fga">11</td> <td class="right" data-stat="fg_pct">.727</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">3</td> <td class="right" data-stat="fta">4</td> <td class="right" data-stat="ft_pct">.750</td> <td class="right" data-stat="orb">3</td> <td class="right" data-stat="drb">5</td> <td class="right" data-stat="trb">8</td> <td class="right" data-stat="ast">5</td> <td class="right" data-stat="stl">2</td> <td class="right iz" data-stat="blk">0</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">19</td> <td class="right" data-stat="game_score">16.6</td> <td class="right" data-stat="plus_minus">+20</td></pre><p> 然后我将把这两个组放入一个二维列表中。</p><p> 我希望这是有道理的。 任何帮助或反馈将不胜感激!</p><p> 我试过:</p><pre> stats = player_header.find_all('td') for stat in stats: print (stat.text)</pre><p> 但我无法将这些标签分组或分成不同的组。</p></div> - How can I split these <td tags from BeautifulSoup on Python?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 从 BeautifulSoup 中两个 Span 标签之间的 A 标签获取信息? - How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python? 我如何在beautifulsoup中获得href标签? - how can i get the href tag in beautifulsoup? 如何从 beautifulsoup4 中的标签获取命名空间信息? - How can I get namespace information from tag in beautifulsoup4? 如何获取<a>在 python 中使用 BeautifulSoup 的 href 属性中的数据?</a> - how can i get data that is in href attribute of <a> using BeautifulSoup in python? Python + BeautifulSoup:如何从 href 属性获取完整链接? - Python + BeautifulSoup: How can I get full link from href attribute? 使用 BeautifulSoup + Python 从列表中获取所有 href 标签和链接 - Get all href tags and links from a list using BeautifulSoup + Python Python Beautifulsoup,获取href标签,在一个标签中 - Python Beautifulsoup, get href tag, in a tag 从td标签BeautifulSoup Python获取href属性链接 - Get href Attribute Link from td tag BeautifulSoup Python 无法使用 python 中的 beautifulsoup 获取 div 内的所有 id 标签和 a/href 标签 - Unable to get all the id tags and a/href tags inside a div using beautifulsoup in python 我怎样才能分开这些<div id="text_translate"><p>我正在抓取一个网站,但我很难理解。</p><p> 我试图将标签分成两组,所以当我运行 for 循环时,它应该是:</p><pre> # Group 1 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-19&amp;is_playoff_game=N" data-stat="game_season"><strong>1</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210190MIA.html">2022-10-19</a></td> <td class="right" data-stat="age">25-093</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/CHI/2023.html">CHI</a></td> <td class="center" csk="-8" data-stat="game_result">L (-8)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2040" data-stat="mp">34:00</td> <td class="right" data-stat="fg">5</td> <td class="right" data-stat="fga">15</td> <td class="right" data-stat="fg_pct">.333</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">2</td> <td class="right" data-stat="fta">3</td> <td class="right" data-stat="ft_pct">.667</td> <td class="right" data-stat="orb">1</td> <td class="right" data-stat="drb">8</td> <td class="right" data-stat="trb">9</td> <td class="right" data-stat="ast">2</td> <td class="right iz" data-stat="stl">0</td> <td class="right" data-stat="blk">1</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">12</td> <td class="right" data-stat="game_score">1.7</td> <td class="right" data-stat="plus_minus">-15</td> # Group 2 <td class="right endpoint tooltip" data-endpoint="/players/pgl_cum_stats.cgi?player=adebaba01&amp;year=2023&amp;date_game=2022-10-21&amp;is_playoff_game=N" data-stat="game_season"><strong>2</strong></td> <td class="left" data-stat="date_game"><a href="/boxscores/202210210MIA.html">2022-10-21</a></td> <td class="right" data-stat="age">25-095</td> <td class="left" data-stat="team_id"><a href="/teams/MIA/2023.html">MIA</a></td> <td class="center iz" data-stat="game_location"></td> <td class="left" data-stat="opp_id"><a href="/teams/BOS/2023.html">BOS</a></td> <td class="center" csk="-7" data-stat="game_result">L (-7)</td> <td class="right" data-stat="gs">1</td> <td class="right" csk="2093" data-stat="mp">34:53</td> <td class="right" data-stat="fg">8</td> <td class="right" data-stat="fga">11</td> <td class="right" data-stat="fg_pct">.727</td> <td class="right iz" data-stat="fg3">0</td> <td class="right iz" data-stat="fg3a">0</td> <td class="right iz" data-stat="fg3_pct"></td> <td class="right" data-stat="ft">3</td> <td class="right" data-stat="fta">4</td> <td class="right" data-stat="ft_pct">.750</td> <td class="right" data-stat="orb">3</td> <td class="right" data-stat="drb">5</td> <td class="right" data-stat="trb">8</td> <td class="right" data-stat="ast">5</td> <td class="right" data-stat="stl">2</td> <td class="right iz" data-stat="blk">0</td> <td class="right" data-stat="tov">5</td> <td class="right" data-stat="pf">4</td> <td class="right" data-stat="pts">19</td> <td class="right" data-stat="game_score">16.6</td> <td class="right" data-stat="plus_minus">+20</td></pre><p> 然后我将把这两个组放入一个二维列表中。</p><p> 我希望这是有道理的。 任何帮助或反馈将不胜感激!</p><p> 我试过:</p><pre> stats = player_header.find_all('td') for stat in stats: print (stat.text)</pre><p> 但我无法将这些标签分组或分成不同的组。</p></div> - How can I split these <td tags from BeautifulSoup on Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM