使用Python遍歷不同的網頁

Question

我目前正在學習大數據課程，但是對它的了解不多。 對於一項任務，我想找出有關阿姆斯特丹的TripAdvisor論壇上討論了哪些主題。 我想創建一個CSV文件，其中包括主題，作者和每個主題的回復數量。 一些問題：

如何列出所有主題？ 我檢查了所有頁面的網站源，並且總是在'onclick="setPID(34603)'后面聲明該主題，並以</a>結尾。我嘗試了'(re.findall(r'onclick="setPID(34603)">(.*?)</a>' ，post）'，但它不起作用。
答復不在commentsection中給出，而是在頁面的單獨一行中給出。 如何進行循環並將所有答復附加到新變量中？
如何循環瀏覽前20頁？ 我代碼中的URL僅包含第一頁，提供20個主題。
是否在循環之前或之后創建CSV文件？

這是我的代碼：

from urllib import request
import re
import csv

topiclist=[]
metalist=[]

req = request.Request('https://www.tripadvisor.com/ShowForum-g188590-i60- 
Amsterdam_North_Holland_Province.html', headers={'User-Agent' : 
"Mozilla/5.0"})

tekst=request.urlopen(req).read()
tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("\n"," ")
.replace("\t"," ")


topicsection=re.findall(r'<b><a(.*?)</div>',tekst)

topic=[]
for post in topicsection:
   topic.append(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)


author=[]
for post in topicsection: 
   author.append(re.findall(r'<a href="/members-forums/.*?">(.*?)</a>', 
   post))

replies=re.findall(r'<td class="reply rowentry.*?">(.*?)</td>',tekst)

Answer 1

不要使用正則表達式來解析HTML。 使用html解析器，例如beautifulsoup。

例如-

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.tripadvisor.com/ShowForum-g188590-i60-Amsterdam_North_Holland_Province.html")
soup = BeautifulSoup(r.content, "html.parser") #or another parser such as lxml
topics = soup.find_all("a", {'onclick': 'setPID(34603)'})
#do stuff

使用Python遍歷不同的網頁

問題描述

1 個解決方案

解決方案1
3 2016-05-15 17:28:15

使用Python遍歷不同的網頁

問題描述

1 個解決方案

解決方案1 3 2016-05-15 17:28:15

解決方案1
3 2016-05-15 17:28:15