美丽的汤和解析 reddit

Question

just been trying to parse reddit shower thoughts for the submissions and have run into a problem:只是试图解析提交的 reddit 淋浴想法，但遇到了一个问题：

 path = 'https://www.reddit.com/r/Showerthoughts/'

 with requests.Session() as s:

    r = s.get(path)
    soup = BeautifulSoup(r.content, "lxml")

    # print(soup.prettify())

    threads = soup.find_all('p')


    for thread in threads:
        soup = thread
        text = soup('a')
        try:
            print(text[0])
        except:
            pass

in this code I am trying to just get the title of each submission which is enclosed in a < p > tag and then an < a > tag with class "title may-blank".在这段代码中，我试图获取每个提交的标题，该标题包含在 < p > 标签中，然后是一个 < a > 标签，其类别为“title may-blank”。 but above code returns all elements with an a tag of which there are many and even thought the titles are there I would have to go through two more interations of soup.findAll() , and I am sure there is a less manual way of searching through the soup to print all of the titles但上面的代码返回所有带有 a 标签的元素，其中有很多甚至认为标题在那里我必须再经历两次soup.findAll() ，而且我确信有一种更少手动的搜索方式通过汤打印所有的标题

from my knowledge I tried to do根据我的知识，我试图做

titles = soup.findAll( "a", {"class":"title may-blank}) for title in titles: print(title.string)
but this didnt work any thoughts?但这没有任何想法？ PS I know this can be done with the reddit API and is more efficient, but I want to improve my parsing skills because they are not up to scratch. PS 我知道这可以通过 reddit API 完成并且效率更高，但我想提高我的解析技能，因为它们不符合标准。 Thank you for the help感谢您的帮助

Answer 1

They are css classes, also you need to add a user-agent:它们是css类，您还需要添加一个用户代理：

import requests
from bs4 import BeautifulSoup
path = 'https://www.reddit.com/r/Showerthoughts/'
headers ={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36"}
with requests.Session() as s:
    r = s.get(path, headers=headers)
    soup = BeautifulSoup(r.content, "lxml")
    threads = soup.select('a.title.may-blank')
    for a in threads:
        print(a)

You could also use soup.find_all("a", class_="title") but that could match more than you want.你也可以使用soup.find_all("a", class_="title")但这可能比你想要的更多。

美丽的汤和解析 reddit

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-07-10 16:45:09

美丽的汤和解析 reddit

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-07-10 16:45:09

解决方案1
2 已采纳 2016-07-10 16:45:09