简体   繁体   English

餐桌和美丽汤的问题

[英]Issue with Tables and Beautiful Soup

I'm trying to tags that are nested in a tr tag, but the identifier that I'm using to find the correct value is nested in another td within the tr tag. 我试图标记嵌套在tr标记中的标记,但是我用来查找正确值的标识符嵌套在tr标记中的另一个td中。

That is, I'm using the website LoLKing 也就是说,我正在使用网站LoLKing

And trying to scrape it for statistics based on a name, for example, Ahri. 并尝试根据名称(例如Ahri)对它进行统计。

The HTML is: HTML是:

<tr>
            <td data-sorttype="string" data-sortval="Ahri" style="text-align: left;">
                <div style="display: table-cell;">
                <div class="champion-list-icon" style="background:url(//lkimg.zamimg.com/shared/riot/images/champions/103_32.png)">
                    <a style="display: inline-block; width: 28px; height: 28px;" href="/champions/ahri"></a>
                </div>
                </div>
                <div style="display: table-cell; vertical-align: middle; padding-top: 3px; padding-left: 5px;"><a href="/champions/ahri">Ahri</a></div>
            </td>
            <td style="text-align: center;"  data-sortval="975"><img src='//lkimg.zamimg.com/images/rp_logo.png' width='18' class='champion-price-icon'>975</td>
            <td style="text-align: center;" data-sortval="6300"><img src='//lkimg.zamimg.com/images/ip_logo.png' width='18' class='champion-price-icon'>6300</td>
            <td style="text-align: center;" data-sortval="10.98">10.98%</td>
            <td style="text-align: center;" data-sortval="48.44">48.44%</td>
            <td style="text-align: center;" data-sortval="18.85">18.85%</td>
            <td style="text-align: center;" data-sorttype="string" data-sortval="Middle Lane">Middle Lane</td>
            <td style="text-align: center;" data-sortval="1323849600">12/14/2011</td>
        </tr> 

I'm having problems extracting the statistics, which are nested in td tags outside of the data-sortval. 我在提取统计信息时遇到问题,这些统计信息嵌套在数据排序之外的td标签中。 I imagine that I want to pull ALL the tr tags, but I don't know how to pull the tr tag based off of the one that contains the td tag with data-sortval="Ahri". 我想我想提取所有tr标记,但是我不知道如何基于包含td标记和data-sortval =“ Ahri”的标记来提取tr标记。 At that point, I would want to step through the tr tag x times until I reach the first statistic I want, 10.98 到那时,我要逐步遍历tr标签x次,直到达到我想要的第一个统计数据10.98

At the moment, I'm trying to do a find for the td with data-sortval Ahri, but it doesn't return the rest of the tr. 目前,我正在尝试使用数据排序的Ahri为td查找,但它不会返回tr的其余部分。

It might be important to not that all of this is nested inside if a larger tag: 重要的是,如果标记较大,则不要将所有这些内容嵌套在其中:

  <table class="clientsort champion-list" width="100%" cellspacing="0" cellpadding="0">
    <thead>
    <tr><th>Champion</th><th>RP Cost</th><th>IP Cost</th><th>Popularity</th><th>Win Rate</th><th>Ban Rate</th><th>Meta</th><th>Released</th></tr>     
    </thead>
    <tbody>

I apologize for the lack of clarity, I'm new with this scraping terminology, but I hope that makes enough sense. 抱歉,我不了解您的意思,这是我的新手,但我希望这很有意义。 Right now, I'm also doing: 现在,我还在做:

main = soup.find('table', {'class':'clientsort champion-list'})

To get only that table 只得到那个桌子

edit: 编辑:

I typed this for the variable: 我为变量输入了它:

for champ in champs:
    a = str(champ)
    print type(a) is str
    td_name = soup.find('td',{"data-sortval":a})

It confirms that a is a string. 它确认a是一个字符串。 But it throws this error: 但这会引发此错误:

  File "lolrec.py", line 82, in StatScrape
    tr = td_name.parent
AttributeError: 'NoneType' object has no attribute 'parent'

GO LOL! 大声笑!

For commercial purpose, please read the terms of services before scraping. 出于商业目的,请在刮刮之前阅读服务条款。

(1) To scrape a list of heroes, you can do this, which follows a similar logic as you described. (1)要抓取英雄列表,您可以按照与您描述的类似逻辑进行操作。

from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# locate the cell that contains hero name: Ahri 
hero_list = ["Blitzcrank", "Ahri", "Akali"]
for hero in hero_list:
    td_name = soup.find('td', {"data-sortval":hero})
    tr = td_name.parent
    popularity = tr.find_all('td', recursive=False)[3].text
    print hero, popularity

Output 产量

Blitzcrank 12.58%
Ahri 10.98%
Akali 7.52%

Output 产量

10.98%

(2) To scrape all the heroes. (2)铲除所有英雄。

from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.lolking.net/champions/')
soup = BeautifulSoup(html)
# find the table first
table = soup.find('table', {"class":"clientsort champion-list"})
# find the all the rows
for row in table.find('tbody').find_all("tr", recursive=False):
    cols = row.find_all("td")
    hero = cols[0].text.strip()
    popularity = cols[3].text
    print hero, popularity

Output: 输出:

Aatrox 6.86%
Ahri 10.98%
Akali 7.52%
Alistar 4.9%
Amumu 8.75%
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM