简体   繁体   English

美丽的汤和解析 reddit

[英]beautiful soup and parsing reddit

just been trying to parse reddit shower thoughts for the submissions and have run into a problem:只是试图解析提交的 reddit 淋浴想法,但遇到了一个问题:

 path = 'https://www.reddit.com/r/Showerthoughts/'

 with requests.Session() as s:

    r = s.get(path)
    soup = BeautifulSoup(r.content, "lxml")

    # print(soup.prettify())

    threads = soup.find_all('p')


    for thread in threads:
        soup = thread
        text = soup('a')
        try:
            print(text[0])
        except:
            pass

in this code I am trying to just get the title of each submission which is enclosed in a < p > tag and then an < a > tag with class "title may-blank".在这段代码中,我试图获取每个提交的标题,该标题包含在 < p > 标签中,然后是一个 < a > 标签,其类别为“title may-blank”。 but above code returns all elements with an a tag of which there are many and even thought the titles are there I would have to go through two more interations of soup.findAll() , and I am sure there is a less manual way of searching through the soup to print all of the titles但上面的代码返回所有带有 a 标签的元素,其中有很多甚至认为标题在那里我必须再经历两次soup.findAll() ,而且我确信有一种更少手动的搜索方式通过汤打印所有的标题

from my knowledge I tried to do根据我的知识,我试图做

titles = soup.findAll( "a", {"class":"title may-blank}) for title in titles: print(title.string)
but this didnt work any thoughts?但这没有任何想法? PS I know this can be done with the reddit API and is more efficient, but I want to improve my parsing skills because they are not up to scratch. PS 我知道这可以通过 reddit API 完成并且效率更高,但我想提高我的解析技能,因为它们不符合标准。 Thank you for the help感谢您的帮助

They are css classes, also you need to add a user-agent:它们是css类,您还需要添加一个用户代理:

import requests
from bs4 import BeautifulSoup
path = 'https://www.reddit.com/r/Showerthoughts/'
headers ={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36"}
with requests.Session() as s:
    r = s.get(path, headers=headers)
    soup = BeautifulSoup(r.content, "lxml")
    threads = soup.select('a.title.may-blank')
    for a in threads:
        print(a)

You could also use soup.find_all("a", class_="title") but that could match more than you want.你也可以使用soup.find_all("a", class_="title")但这可能比你想要的更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM