I am trying to scrape a public facebook group using beautifulsoup, I am using the mobile site for the lack of javascript there. So this script supposed to get the link from the 'more' keyword and get the text from p tag there, but it just gets the text from the current page's p tag. Can someone point me the problem? I am new to python and everything in this code.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import requests
browser = webdriver.Firefox()
browser.get('https://mobile.facebook.com/groups/22012931789?refid=27')
for elem in browser.find_elements_by_link_text('More'):
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())
It's always useful to see what your script is actually doing, a quick way of doing this is by printing your results at certain steps along the way.
For example, using your code:
for elem in browser.find_elements_by_link_text('More'):
print("elem's href attribute: {}".format(elem.get_attribute("href")))
You'll notice that the first one's blank. We should test for this before trying to get requests to fetch it:
for elem in browser.find_elements_by_link_text('More'):
if elem.get_attribute("href"):
print("Trying to get {}".format(elem.get_attribute("href")))
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())
Note that an empty elem.get_attribute("href")
returns an empty unicode string, u''
- but pythons considers an empty string to be false, which is why that if
works.
Which works fine on my machine. Hope that helps!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.