[英]scraping with BS4
代碼生成空文件。 可能缺少正確的 div/tag 條目 (?)。 試圖在一個站點上抓取多個頁面。
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36 Edg/91.0.864.71'}
questionlist = []
def getQuestions(tag, page):
url = f'https://www.tradepractitioner.com/tag/{tag}'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
questions = soup.find_all('div', {'class': 'main grid '})
for item in questions:
question = {
'title': item.find('a', {'class': 'post-title'}).text,
'status': item.find('a', {'class': 'post-content'}).text,
}
questionlist.append(question)
return
for x in range(1,5):
getQuestions('cfius', x)
df = pd.DataFrame(questionlist)
df.to_excel('stackquestions.xlsx', index=False)
print('End.')
你有一個尾隨空格:
代替:
questions = soup.find_all('div', {'class': 'main grid '}) # <- HERE " '"
經過:
questions = soup.find_all('div', {'class': 'main grid'})
現在你有另一個問題:
AttributeError: 'NoneType' object has no attribute 'text'
解決方案
questions = soup.find_all('article', {'class': 'post'})
for question in questions:
question = {'title': question.find('h1', {'class': 'post-title'}).find('a').text,
'status': question.find('section', {'class': 'post-content'}).find(text=True)}
questionlist.append(question)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.