[英]web scraping problem with data i don't know how to export information from file.html to my python programme
recently i start studying Web scraping, and today i made myself a challenge i tried to write information about every world from tibia.com, about what's name has world, how many people playing on it, what type of server is it, etc.最近我开始研究 Web 抓取,今天我给自己一个挑战,我试图从 tibia.com 写出关于每个世界的信息,关于世界的名称,有多少人在上面玩,它是什么类型的服务器等。
i created something like this我创造了这样的东西
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup as soup
from fake_useragent import UserAgent
my_url = 'https://www.tibia.com/community/?subtopic=worlds'
uClient = urlopen(Request(my_url, headers={'User-Agent': 'Mozilla'}))
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":['Even', 'Odd']})
for container in containers:
informations = containers.findAll("td")
world = informations[0].txt
but i don't know how can I pull out information from td, my data file looks like:但我不知道如何从 td 中提取信息,我的数据文件如下所示:
<tr class="Odd">
<td style="width: 150px;"><a href="https://www.tibia.com/community/?subtopic=worlds&world=Cosera">Cosera</a>
</td>
<td style="text-align: right;">75</td>
<td>North America</td>
<td>Optional PvP</td>
it's one from 92 worlds, and what i'm looking for is how can i extract information about world from this line它来自 92 个世界,我正在寻找的是如何从这条线上提取有关世界的信息
<td style="width: 150px;"><a href="https://www.tibia.com/community/?subtopic=worlds&world=Cosera">Cosera</a>
and if you give me note how to do this, everything else i think i will figure out.如果你给我注意如何做到这一点,我想我会弄清楚其他一切。
If someone has idea I would be greatful for your clue.如果有人有想法,我会很感激你的线索。
I'm not exactly sure what you mean but I'll try to give a solution to your problem.我不完全确定您的意思,但我会尝试为您的问题提供解决方案。
It looks like you're trying to get all the row information from the table on the page.看起来您正试图从页面上的表格中获取所有行信息。 The simplest way to do this is to first get all the <tr> elements (all the rows) which you had already successfully done.
最简单的方法是首先获取您已经成功完成的所有 <tr> 元素(所有行)。
Then we want to loop through these rows to extract the data from them.然后我们要遍历这些行以从中提取数据。
I'm not sure if you only want the 'Cosera' world, or just the whole table.我不确定您是只想要“Cosera”世界,还是只想要整张桌子。 If you want the whole table you can just remove the
if
statement in the code below.如果你想要整个表,你可以删除下面代码中的
if
语句。
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup as soup
my_url = 'https://www.tibia.com/community/?subtopic=worlds'
world_to_find = 'Cosera'
uClient = urlopen(Request(my_url, headers={'User-Agent': 'Mozilla'}))
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
all_rows = page_soup.find_all('tr', {"class":["Odd", "Even"]})
for row in all_rows:
if (row.select_one("td").text == world_to_find):
data = {}
row = row.findChildren("td" , recursive=False)
data['world'] = row[0].text
data['online'] = row[1].text
data['location'] = row[2].text
data['pvp_type'] = row[3].text
data['additional_info'] = row[5].text
print(data)
Outputs:输出:
{'world': 'Cosera', 'online': '86', 'location': 'North America', 'pvp_type': 'Optional PvP', 'additional_info': 'blocked'}
If this wasn't what you meant please explain in your post what exactly you want the output to be如果这不是你的意思,请在你的帖子中解释你到底想要 output 是什么
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.