简体   繁体   English

使用beautifulsoup从页面中抓取特定的机场代码

[英]Scraping specific airport codes from page using beautifulsoup

This is my first post, please feel free to let me know how I could be posting better and thanks in advance for the help. 这是我的第一篇文章,请随时让我知道如何更好地发表文章,并在此先感谢您的帮助。

I am learning how scrape data from webpages with python using BeautifulSoup, and am having difficulty scraping all the airports where loungebuddy operates. 我正在学习如何使用BeautifulSoup使用python从网页抓取数据,并且很难抓取休息室预算运作的所有机场。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.loungebuddy.com/select/locations')
soup = BeautifulSoup(page.text, 'html.parser')
airport_code_html_lines = soup.find_all( attrs={'class': 'aiprt-code'})

This gets me very close, but I have extraneous data. 这使我非常接近,但是我有多余的数据。 The result I want is the second line in each of the results provided by: 我想要的结果是以下提供的每个结果的第二行:

    for airport_code in airport_code_html_lines:
        print(airport_code.prettify())

I'm trying to personalize this very simple case here: 我在这里尝试个性化这个非常简单的案例:

https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe

where the author pulls the price portion. 作者在其中拉出价格部分。 However, when i try to do the equivalent of 但是,当我尝试做相当于

price = price_box.text

I get this error: 我收到此错误:

AttributeError: ResultSet object has no attribute 'txt'. You're probably 
treating a list of items like a single item. Did you call find_all() when you 
meant to call find()?

Python guessed correctly, I'm using a find all...but I don't know how else to proceed. Python猜对了,我正在使用所有内容...但是我不知道该怎么做。

I have tried using different print functions like 我尝试使用不同的打印功能,例如

print(airport_code.strip('>'))

To see if I could strip or isolate the code by creating new variables or with creative print commands, but I get this: 看我是否可以通过创建新变量或使用创造性的打印命令来剥离或隔离代码,但是我得到了:

TypeError: 'NoneType' object is not callable

I would love either direction in what to try next (considering changing the find_all to a find, and then creating a for loop....but that's intimidating to me. Hoping for a cleaner solution), or working code which will spit out my desired result. 我很乐意选择下一步的方向(考虑将find_all更改为一个find,然后创建一个for循环...。但这对我来说很吓人。希望有一个更清洁的解决方案),或者是可以将我吐出来的工作代码预期的结果。 I hope to learn python both through this project and in the future, so any comments on my thought process are appreciated. 我希望通过这个项目以及将来都可以学习python,因此对我的思考过程的任何评论都值得赞赏。

Thanks again 再次感谢

Simply replacing print(airport_code.prettify()) with print(airport_code.text) will give you the output you want. 只需将print(airport_code.prettify())替换为print(airport_code.prettify())print(airport_code.text)的输出。

Try the following code (made it a bit cleaner): 尝试以下代码(使其更简洁):

page = requests.get('https://www.loungebuddy.com/select/locations')
soup = BeautifulSoup(page.text, 'html.parser')

for country in soup.find_all('span', class_='aiprt-code'):
    print(country.text)

You can use soup.find_all('span', {'class': 'aiprt-code'}) instead of soup.find_all('span', class_='aiprt-code') too. 您也可以使用soup.find_all('span', {'class': 'aiprt-code'})来代替soup.find_all('span', class_='aiprt-code') It's the same thing. 这是同一件事。

Output: 输出:

BNE
SYD
BGI
BRU
...
...

Or if you want the countries in a list, you can use list comprehension as shown below. 或者,如果您想要列表中的国家/地区,则可以使用列表理解功能 ,如下所示。 It helps in storing, using and modifying the data. 它有助于存储,使用和修改数据。

countries = [x.text for x in soup.find_all('span', class_='aiprt-code')]
print(countries)

Output: 输出:

['BNE', 'SYD', 'BGI', 'BRU', 'GIG', 'SOF', 'PNH', 'REP', ... ]

If you wanna have the airport code along with it's country name then you can try like below: 如果您想要机场代码及其国家名称,则可以尝试如下操作:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.loungebuddy.com/select/locations')
soup = BeautifulSoup(page.text, 'html.parser')
airport_code = {item.select_one("h2").text:item.select_one(".aiprt-code").text for item in soup.select(".country")}
print(airport_code)

Partial output: 部分输出:

{'India': 'BLR', 'Poland': 'KTW', 'Thailand': 'BKK', 'Croatia': 'ZAG',--so on--}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM