[英]the right approach to use BeautifulSoup n python3
我正在嘗試使用BeautifulSoup庫在python上構建Web刮板。 我想從比特幣論壇主題主題的所有頁面中獲取信息。 我正在使用以下代碼從論壇https://bitcointalk.org/index.php?topic=2056041.0獲取用戶名,狀態,發布日期和時間,發布文本,活動,價值
url='https://bitcointalk.org/index.php?topic=2056041.0'
from bs4 import BeautifulSoup
import requests
import re
def get_html(url):
r = requests.get(url)
return r.text
html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
results= soup.findAll("td", {"valign" : "top"})
usernames=[]
for i in results:
x=i.findAll('b')
try:
s=str(x[0])
if 'View the profile of' in s :
try:
found = re.search('of (.+?)">', s).group(1)
if found.isdigit()==False:
usernames.append(found)
except Exception as e :print(e)
except Exception as e :pass#print(e)
print(len(usernames))
status=[]
for i in results:
x=i.findAll("div", {"class": "smalltext"})
s=str(x)
try:
found = re.search(' (.+?)<br/>', s).group(1)
if len(found)<25:
status.append(found)
except:pass
print(len(status))
activity=[]
for i in results:
x=i.findAll("div", {"class": "smalltext"})
s=str(x)
try:
x=s.split('Activity: ')[1]
x=x.split('<br/>')[0]
activity.append(x)
except Exception as e :pass
print(activity)
print(len(activity))
posts=[]
for i in results:
x=i.findAll("div", {"class": "post"})
s=str(x)
try:
x=s.split('="post">')[1]
x=x.split('</div>]')[0]
if x.isdigit()!=True:
posts.append(x)
except Exception as e :pass
print(len(posts))
我覺得使用其他所有嘗試(除了等)這是一個非常丑陋且不正確的解決方案。針對此任務是否有更直接,更優雅的解決方案?
你是對的。 它很丑。
您說您正在嘗試使用BeautifulSoup
進行抓取,但是您沒有在任何地方使用已解析的soup
對象。 如果您打算將soup
對象轉換為字符串並使用regex進行解析,則可能還跳過了BeautifulSoup
的導入,而直接在r.text
上使用了regex。
使用正則表達式解析HTML是一個壞主意。 原因如下:
您似乎只是發現BeautifulSoup
可用於解析HTML,但並沒有費心閱讀文檔:
了解如何瀏覽HTML樹。 對於這樣的簡單任務,他們的正式文檔已經足夠了:
usernames = []
statuses = []
activities = []
posts = []
for i in soup.find_all('td', {'class': 'poster_info'}):
j = i.find('div', {'class': 'smalltext'}).find(text=re.compile('Activity'))
if j:
usernames.append(i.b.a.text)
statuses.append(i.find('div', {'class': 'smalltext'}).contents[0].strip())
activities.append(j.split(':')[1].strip())
posts.append(i.find_next('td').find('div', {'class': 'post'}).text.strip())
這是打印其長度的結果:
>>> len(usernames), len(statuses), len(activities), len(posts)
(20, 20, 20, 20)
這是實際的內容:
for i, j, k, l in zip(usernames, statuses, activities, posts):
print('{} - {} - {}:\n{}\n'.format(i, j, k, l))
結果:
hous26 - Full Member - 280:
Just curious. Not counting anything less than a dollar in total worth. I own 9 coin types:
satoshforever - Member - 84:
I own three but plan to add three more soon. But is this really a useful question without the size of the holdings?
.
.
.
papajamba - Full Member - 134:
7 coins as of the moment. Thinking of adding xrp again though too. had good profit when it was only 800-900 sats
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.