简体   繁体   English

有没有办法去除 Python 代码中多余的间距?

[英]Is there a way to remove excess spacing in Python code?

My code below gets the street address for each gym, but there is an error in the spacing of the output for the hours that the gym is open.我下面的代码获取每个健身房的街道地址,但是健身房开放时间的 output 的间距存在错误。 Any ideas of where I went wrong?关于我哪里出错的任何想法?

import urlparse

from bs4 import BeautifulSoup
from bs4 import Tag
import requests
import time
import csv

sitemap = 'https://www.planetfitness.com/sitemap'
sitemap_content = requests.get(sitemap).content
soup = BeautifulSoup(sitemap_content, 'html.parser')

atags = soup.select('td[class~=club-title] > a[href^="/gyms"]')
links = [atag.get('href') for atag in atags]

with open('gyms.csv', 'w') as gf:
    gymwriter = csv.writer(gf)
    for link in links:
        gymurl = urlparse.urljoin(sitemap, link)
        sitemap_content = requests.get(gymurl).content
        soup = BeautifulSoup(sitemap_content, 'html.parser')
        gymrow = [ gymurl ]

        address_line1 = soup.select('p[class~=address] > span[class~=address-line1]')
        gymrow.append(address_line1[0].text)
        locality = soup.select('p[class~=address] > span[class~=locality]')
        gymrow.append(locality[0].text)
        administrative_area = soup.select('p[class~=address] > span[class~=administrative-area]')
        gymrow.append(administrative_area[0].text)
        postal_code = soup.select('p[class~=address] > span[class~=postal-code]')
        gymrow.append(postal_code[0].text)
        country = soup.select('p[class~=address] > span[class~=country]')
        gymrow.append(country[0].text)

        strongs = soup.select('div > strong')
        for strong in strongs:
            if strong.text == 'Club Hours':
                for sibling in strong.next_siblings:
                    if isinstance(sibling, Tag):
                        hours = sibling.text
                        gymrow.append(hours)
                        break
        print(gymrow)
        gymwriter.writerow(gymrow)
        time.sleep(3)

Thank you for your help!谢谢您的帮助!

You want to select the td elements (of class club-title ) that contain a elements, and extract the href attribute.您想要 select 包含a元素的td元素(class club-title ),并提取href属性。

from bs4 import BeautifulSoup
from bs4 import Tag
import requests
import urllib.parse
import time
import csv

sitemap = 'https://www.planetfitness.com/sitemap'
res = requests.get(sitemap).content
soup = BeautifulSoup(res, 'html.parser')

# The rows in the table of gyms are formatted like so:
# <tr>
# <td class="club-title"><a href="/gyms/albertville-al"><strong>Albertville, AL</strong> <p>5850 US Hwy 431</p></a></td>
# <td class="club-join"><div class="button"><a href="/gyms/albertville-al/offers" title="Join Albertville, AL">Join Now</a></div></td>
# </tr>

# This will find all the links to all the gyms.
atags = soup.select('td[class~=club-title] > a[href^="/gyms"]')
links = [atag.get('href') for atag in atags]

with open('gyms.csv', 'w') as gf:
    gymwriter = csv.writer(gf)
    for link in links:
        # Follow the link to this gym
        gymurl = urllib.parse.urljoin(sitemap, link)
        res = requests.get(gymurl).content
        soup = BeautifulSoup(res, 'html.parser')
        gymrow = [ gymurl ]

        # The address of this gym.
        address_line1 = soup.select('p[class~=address] > span[class~=address-line1]')
        gymrow.append(address_line1[0].text)
        locality = soup.select('p[class~=address] > span[class~=locality]')
        gymrow.append(locality[0].text)
        administrative_area = soup.select('p[class~=address] > span[class~=administrative-area]')
        gymrow.append(administrative_area[0].text)
        postal_code = soup.select('p[class~=address] > span[class~=postal-code]')
        gymrow.append(postal_code[0].text)
        country = soup.select('p[class~=address] > span[class~=country]')
        gymrow.append(country[0].text)

        # The hours of this gym.
        strongs = soup.select('div > strong')
        for strong in strongs:
            if strong.text == 'Club Hours':
                for sibling in strong.next_siblings:
                    if isinstance(sibling, Tag):
                        hours = sibling.text
                        gymrow.append(hours.replace('<br>', '').replace('\n', ', '))
                        break

        gymwriter.writerow(gymrow)
        time.sleep(3)

When I run this, I get:当我运行它时,我得到:

$ more gyms.csv

https://www.planetfitness.com/gyms/albertville-al,5850 US Hwy 431,Albertville,AL,35950,United States,"Monday-Friday 6am-9pm, Sat
urday-Sunday 7am-7pm"
https://www.planetfitness.com/gyms/alexander-city-al,987 Market Place,Alexander City,AL,35010,United States,Convenient hours whe
n we reopen
https://www.planetfitness.com/gyms/bessemer-al,528 W Town Plaza,Bessemer,AL,35020,United States,Convenient hours when we reopen
https://www.planetfitness.com/gyms/birmingham-crestline-al,4500 Montevallo Rd,Birmingham,AL,35210,United States,Convenient hours
 when we reopen
.
.
.

To try debugging this, you should start by printing out the value of atags.要尝试调试它,您应该首先打印出 atags 的值。 You are searching for all a tags with the class clubs-list of which none exist.您正在搜索 class clubs-list中不存在a所有标签。 The a tags do not have a class, but their parent td has the class club-title . a标签没有 class,但它们的父td有 class club-title

You can try something like this.你可以尝试这样的事情。

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']

for link in links:
    if any(keyword in link for keyword in keywords):
        print(link)

This will get every link and address on that page.这将获得该页面上的每个链接和地址。 It looks like if you want to find more information on each club you'll have to iteratively go through and load each page.看起来如果您想找到有关每个俱乐部的更多信息,您必须迭代 go 并加载每个页面。

from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

atags = soup.find_all('td', {'class':'club-title'})

links = [(atag.find('a')['href'], atag.find('p').text) for atag in atags)]


[print(link) for link in links]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM