简体   繁体   English

使用Python遍历不同的网页

[英]Loop over different webpages using Python

I am currently following a course in Big Data but do not understand much of it. 我目前正在学习大数据课程,但是对它的了解不多。 For an assignment, I would like to find out which topics are discussed on the TripAdvisor-forum about Amsterdam. 对于一项任务,我想找出有关阿姆斯特丹的TripAdvisor论坛上讨论了哪些主题。 I want to create a CSV-file including the topic, the author and the amount of replies per topic. 我想创建一个CSV文件,其中包括主题,作者和每个主题的回复数量。 Some questions: 一些问题:

  1. How can a make a list of all the topics? 如何列出所有主题? I checked the website-source for all the pages and the topic is always stated behind 'onclick="setPID(34603)' and ends with </a> . I tried '(re.findall(r'onclick="setPID(34603)">(.*?)</a>' , post)' but it's not working. 我检查了所有页面的网站源,并且总是在'onclick="setPID(34603)'后面声明该主题,并以</a>结尾。我尝试了'(re.findall(r'onclick="setPID(34603)">(.*?)</a>' ,post)',但它不起作用。
  2. The replies are not given in the commentsection, but in a separate row on the page. 答复不在commentsection中给出,而是在页面的单独一行中给出。 How can I make a loop and append all the replies to a new variable? 如何进行循环并将所有答复附加到新变量中?
  3. How do I loop over the first 20 pages? 如何循环浏览前20页? The URL in my code only includes the 1st page, giving 20 topics. 我代码中的URL仅包含第一页,提供20个主题。
  4. Do I create the CSV file before or after the looping? 是否在循环之前或之后创建CSV文件?

Here is my code: 这是我的代码:

from urllib import request
import re
import csv

topiclist=[]
metalist=[]

req = request.Request('https://www.tripadvisor.com/ShowForum-g188590-i60- 
Amsterdam_North_Holland_Province.html', headers={'User-Agent' : 
"Mozilla/5.0"})

tekst=request.urlopen(req).read()
tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("\n"," ")
.replace("\t"," ")


topicsection=re.findall(r'<b><a(.*?)</div>',tekst)

topic=[]
for post in topicsection:
   topic.append(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)


author=[]
for post in topicsection: 
   author.append(re.findall(r'<a href="/members-forums/.*?">(.*?)</a>', 
   post))

replies=re.findall(r'<td class="reply rowentry.*?">(.*?)</td>',tekst)

Don't use regular expressions to parse HTML. 不要使用正则表达式来解析HTML。 Use an html parser such as beautifulsoup. 使用html解析器,例如beautifulsoup。

eg - 例如-

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.tripadvisor.com/ShowForum-g188590-i60-Amsterdam_North_Holland_Province.html")
soup = BeautifulSoup(r.content, "html.parser") #or another parser such as lxml
topics = soup.find_all("a", {'onclick': 'setPID(34603)'})
#do stuff

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM