Using Python/BeautifulSoup, I'm trying to get the post title and URL from every result returned on Reddit.
Below is part of my code that retrieves all Reddit search results.
url = 'https://www.reddit.com/search/?q=test'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all('a', attrs={'data-click-id':'body'})
for result in results:
print(result.prettify())
title_post = result.find('h3').text
url_post = result.find('a')['href']
soup.find_all('a', attrs={'data-click-id':'body'})
appears to return a list of all search results. This is working as I'm expecting / hoping.
by doing print(result)
, I can validate that it is returning what I need. Below is the result of print(result.prettify())
:
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
<div class="_2SdHzo12ISmrC8H86TgSCp _1zpZYP8cFNLfLDexPY65Y7" style="--posttitletextcolor:#222222">
<h3 class="_eYtD2XCVieq6emjKBH3m">
<span style="font-weight:normal">Match Thread: 3rd
<em style="font-weight:700">Test
</em>- Australia v India, Day 5
</span>
</h3>
</div>
</a>
title_post = result.find('h3').text
extracts the title associated with the comment or post. It is working as expected / hoped.
The problem that I have is with retrieving the address of the post (see href=):
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
The line url_post = result.find('a')['href']
returns an error TypeError: 'NoneType' object is not subscriptable
.
If I could use the "result" as a string, then I could just look for href within it. Something like:
loc = result.text.find('href=')
print(result.text[loc:])
Obviously, this won't work: result.text
does not return the HTML code, but just the string "Match Thread: 3rd Test - Australia v India, Day 5"
Question 1: Is there a way to return only the href="" component?
Question 2: Is there a way to convert the soup object "result" into plain text while keeping the HTML components? If it was possible, then I'd have an easy workaround.
The href
is already in the .attrs
of result
:
>>> for result in results:
... print(result.attrs)
...
{'data-click-id': 'body', 'class': ['SQnoC3ObvgnGjWt90zD9Z', '_2INHSNB8V5eaWp4P0rY_mE'], 'href': '/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/'}
...
so don't call the .find()
method, instead access the href
value using the [key]
notation (like a dictionary).
In your example:
for result in results:
url_post = result["href"]
print(url_post)
Output:
/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/
/r/Cricket/comments/ku008u/match_thread_3rd_test_australia_v_india_day_4/
/r/Cricket/comments/ktcg7n/match_thread_3rd_test_australia_v_india_day_3/
...
You can use PRAW: The Python Reddit API Wrapper
for their API, which is much easier to use than parsing from the webpages. You are obviously not able to access their randomly generated class names.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.