简体   繁体   中英

Parsing a Reddit search result with BeautifulSoup and Python

Using Python/BeautifulSoup, I'm trying to get the post title and URL from every result returned on Reddit.

Below is part of my code that retrieves all Reddit search results.

url = 'https://www.reddit.com/search/?q=test'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all('a', attrs={'data-click-id':'body'})
for result in results:
    print(result.prettify())
    title_post = result.find('h3').text
    url_post = result.find('a')['href']

soup.find_all('a', attrs={'data-click-id':'body'}) appears to return a list of all search results. This is working as I'm expecting / hoping.

by doing print(result) , I can validate that it is returning what I need. Below is the result of print(result.prettify()) :

<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
<div class="_2SdHzo12ISmrC8H86TgSCp _1zpZYP8cFNLfLDexPY65Y7" style="--posttitletextcolor:#222222">
<h3 class="_eYtD2XCVieq6emjKBH3m">
<span style="font-weight:normal">Match Thread: 3rd
<em style="font-weight:700">Test
</em>- Australia v India, Day 5
</span>
</h3>
</div>
</a>

title_post = result.find('h3').text extracts the title associated with the comment or post. It is working as expected / hoped.

The problem that I have is with retrieving the address of the post (see href=):

<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">

The line url_post = result.find('a')['href'] returns an error TypeError: 'NoneType' object is not subscriptable .

If I could use the "result" as a string, then I could just look for href within it. Something like:

loc = result.text.find('href=')
print(result.text[loc:])

Obviously, this won't work: result.text does not return the HTML code, but just the string "Match Thread: 3rd Test - Australia v India, Day 5"

Question 1: Is there a way to return only the href="" component?

Question 2: Is there a way to convert the soup object "result" into plain text while keeping the HTML components? If it was possible, then I'd have an easy workaround.

The href is already in the .attrs of result :

>>> for result in results:
...     print(result.attrs)
...
{'data-click-id': 'body', 'class': ['SQnoC3ObvgnGjWt90zD9Z', '_2INHSNB8V5eaWp4P0rY_mE'], 'href': '/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/'}
...

so don't call the .find() method, instead access the href value using the [key] notation (like a dictionary).

In your example:

for result in results:
    url_post = result["href"]
    print(url_post)

Output:

/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/
/r/Cricket/comments/ku008u/match_thread_3rd_test_australia_v_india_day_4/
/r/Cricket/comments/ktcg7n/match_thread_3rd_test_australia_v_india_day_3/
...

You can use PRAW: The Python Reddit API Wrapper for their API, which is much easier to use than parsing from the webpages. You are obviously not able to access their randomly generated class names.

https://praw.readthedocs.io/en/latest/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM