简体   繁体   English

使用Python和BeautifulSoup编辑字符串并收集数据

[英]Editing Strings and Scraping Data Using Python and BeautifulSoup

I'm trying to scrape the number of registrants for each division in the upcoming World Brazilian Jiu Jitsu Championships using Python and Beautiful Soup. 我正在尝试使用Python和Beautiful Soup刮擦即将举行的世界巴西柔术锦标赛的每个分区的注册人数。 (The goal in the end is to plot the number of competitors in each division, so the name of each registrant is immaterial.) (最终目标是绘制每个部门的竞争者人数,因此每个注册人的姓名都不重要。)

I'm able to pull some of the information I want using BeautifulSoup, but I can't seem to isolate exactly what I want. 我可以使用BeautifulSoup提取一些我想要的信息,但是我似乎无法完全隔离出我想要的东西。

import urllib2
from bs4 import BeautifulSoup
import fileinput
import matplotlib

page = urllib2.urlopen("https://www.ibjjfdb.com/ChampionshipResults/926/PublicRegistrations?lang=en-US").read()
soup = BeautifulSoup(page,'html.parser')

My original plan was to go through all the data, count the names of the people registered in each division, save it as a data set and then plot the information by weight, age, etc. I now see that there's a counter at the bottom of each division, so all I really need to do is identify the weight division and the total. 我最初的计划是遍历所有数据,计算每个部门中注册人员的姓名,将其保存为数据集,然后按体重,年龄等对信息进行绘图。我现在看到底部有一个计数器每个分区的重量,所以我真正需要做的就是确定重量分区和总重量。

Originally I tried using soup.find_all() to pull the information I wanted, but it's giving me a lot of excess. 最初,我尝试使用soup.find_all()提取所需的信息,但这给了我很多soup.find_all() I started with trying to pull the whole table using a command I pulled from another StackOverflow question: 我首先尝试使用从另一个StackOverflow问题中提取的命令提取整个表:

result = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ["row"])

which gives one long string containing what I want: 这给出了一个包含我想要的长字符串:

[<div class="row">\n<h4>BLUE / Juvenile 1 / Male / Rooster</h4>\n<table class="table table-striped">\n<tr>\n<td style="width:40%;">\r\n                                    Atos Jiu-Jitsu\r\n [...]

To narrow it down, I searched for the heading that just gives the weight classes: 为了缩小范围,我搜索了仅给出权重类别的标题:

all_divisions = [division for division in soup.find_all("h4")]

which works well and gives: 效果很好,并提供:

[<h4>BLUE / Juvenile 1 / Male / Rooster</h4>, <h4>BLUE / Juvenile 1 / Male / Light-Feather</h4>, <h4>BLUE / Juvenile 1 / Male / Feather</h4>, <h4>BLUE / Juvenile 1 / Male / Light</h4>, [...]

I want to be able divide these into their separate groups so I can plot it. 我希望能够将它们分为不同的组,以便进行绘制。 How can I split this up? 我该如何拆分? I tried using all_divisions[0].split() but apparently soup.find_all() returns NoneType objects instead of strings. 我尝试使用all_divisions[0].split()但显然soup.find_all()返回NoneType对象而不是字符串。

Next, I'm trying to get the total. 接下来,我试图获取总数。 I can isolate the string "Total:" using soup.find_all('strong') , but it can't seem to get the actual number since that's not included in the bold: 我可以使用soup.find_all('strong')隔离字符串“ Total:”,但是它似乎无法获得实际数字,因为该数字未包含在粗体中:

<td colspan="2">
  <strong>Total:</strong> 8
</td>

I tried to go out one and pick up that row of the table using soup.find_all(lambda tag: tag.name == 'td' and tag.get('colspan') == ["2"]) but it doesn't return anything when I do that. 我试图出去一个,并使用soup.find_all(lambda tag: tag.name == 'td' and tag.get('colspan') == ["2"])拾起桌子的那一行,但是它没有我这样做时什么也不会退货。 How do I pick up the total number of competitors? 我如何挑选竞争对手的总数?

(Since the number of divisions and totals are listed together and in order, I don't think I need to worry about matching the total to the division as long as I'm careful about the ordering.) (由于部门的总数和总数是按顺序列出的,所以我不必担心将总数与部门进行匹配,只要我对顺序保持谨慎即可。)

You can use a loop to get the values you need. 您可以使用循环来获取所需的值。

#Segregate data by rows
    rows = soup.findAll('div',{'class':'row'})

    #Create empty lists
    divisions = []
    competitors = []

    #Create and append respective lists with a loop
    for row in rows:
        raw_division = row.h4.text
        division = raw_division.split('/')
        divisions.append(division)

        raw_competitors = row.tfoot.tr.td.text.strip()
        competitor = raw_competitors.replace('Total: ','')
        competitors.append(competitor)

You should now have a list of lists --> divisions as well as a list --> competitors. 您现在应该有一个列表列表->部门以及一个列表->竞争对手。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM