I'm a beginner in Python and web scraping in general. In this code I'm using both Bs4 and Selenium. I'm using Selenium to make clicking on the 'show more' button automatic,so that I can scrape all of the results, not just the ones found on the first page of results shown. I'm trying to scrape the following website: https://boards.euw.leagueoflegends.com/en/search?query=improve
However, when combining Bs4 and Selenium together, 3 fields that I'm scraping (username, server and topic) now give me the following two errors.
1) I get a AttributeError: 'NoneType' object has no attribute 'text' for both server and username
Traceback (most recent call last):
File "failoriginale.py", line 153, in <module>
main()
File "failoriginale.py", line 132, in main
song_data = get_songs(index_page) # Get songs with metadata
File "failoriginale.py", line 81, in get_songs
username = row.find(class_='username').text.strip()
AttributeError: 'NoneType' object has no attribute 'text'
2)I get this error with topic
Traceback (most recent call last):
File "failoriginale.py", line 153, in <module>
main()
File "failoriginale.py", line 132, in main
song_data = get_songs(index_page) # Get songs with metadata
File "failoriginale.py", line 86, in get_songs
topic = row.find('div', {'class':'discussion-footer byline opaque'}).find_all('a')[1].text.strip()
IndexError: list index out of range
However, before combining bs4 with Selenium, these 3 fields worked just like the others, so I think that the problem is elsewhere. I don't get it what's the problem in the main function with song_data? I have already looked up other questions on stackoverflow but I could not solve the problem. I'm new to scraping and bs4, selenium libraries so sorry if I'm asking a silly question.
Here's the code:
browser = webdriver.Firefox(executable_path='./geckodriver')
browser.get('https://boards.euw.leagueoflegends.com/en/search?query=improve&content_type=discussion')
html = browser.page_source #page_source is where selenium stores the html source
def get_songs(url):
html = browser.page_source
index_page = BeautifulSoup(html,'lxml') # Parse the page
items = index_page.find(id='search-results') # Get the list on from the webpage
if not items: # If the webpage does not contain the list, we should exit
print('Something went wrong!', file=sys.stderr)
sys.exit()
data = list()
# button show more, if the page has the show more button, it will click on that x5secs
if index_page.find('a', {"class": "box show-more",}):
button = browser.find_element_by_class_name('box.show-more')
timeout = time.time() + 5
while True:
button.click()
time.sleep(5.25)
if time.time() > timeout:
break
html = browser.page_source
index_page = BeautifulSoup(html,'lxml')
items = index_page.find(id='search-results')
for row in items.find_all(class_='discussion-list-item'):
username = row.find(class_='username').text.strip()
question = row.find(class_='title-span').text.strip()
sentence = row.find('span')['title']
serverzone = row.find(class_='realm').text.strip()
#print(serverzone)
topic = row.find('div', {'class':'discussion-footer byline opaque'}).find_all('a')[1].text.strip()
#print(topic)
date=row.find(class_='timeago').get('title')
#print(date)
views = row.find(class_='view-counts byline').find('span', {'class' : 'number opaque'}).get('data-short-number')
comments = row.find(class_='num-comments byline').find('span', {'class' : 'number opaque'}).get('data-short-number')
# Store the data in a dictionary, and add that to our list
data.append({
'username': username,
'topic':topic,
'question':question,
'sentence':sentence,
'server':serverzone,
'date':date,
'number_of_comments':comments,
'number_of_views':views
})
return data
def get_song_info(url):
browser.get(url)
html2 = browser.page_source
song_page = BeautifulSoup(html2, features="lxml")
interesting_html= song_page.find('div', {'class' : 'list'})
if not interesting_html: # Check if an article tag was found, not all pages have one
print('No information availible for song at {}'.format(url), file=sys.stderr)
return {}
answer = interesting_html.find('span', {'class' : 'high-quality markdown'}).find('p').text.strip() #.find('span', {"class": "high-quality markdown",}).find('p')
return {'answer': answer} # Return the data of interest
def main():
index_page = BeautifulSoup(html,'lxml')
song_data = get_songs(index_page) # Get songs with metadata
#for each row in the improve page enter the link and extract the data
for row in song_data:
print('Scraping info on {}.'.format(row['link'])) # Might be useful for debugging
url = row['link'] #defines that the url is the column link in the csv file
song_info = get_song_info(url) # Get lyrics and credits for this song, if available
for key, value in song_info.items():
row[key] = value # Add the new data to our dictionary
with open('results.csv', 'w', encoding='utf-8') as f: # Open a csv file for writing
fieldnames=['link','username','topic','question','sentence','server','date','number_of_comments','number_of_views','answer'] # These are the values we want to store
Thanks for your help!
I would be tempted to use requests to retrieve the total results count and number of results per batch and loop clicking the button with a wait condition until all results present. Then grab them in one go so to speak. Outline below which can be re-written as required. You could always use an n
end point to stop clicking after n
pages and increment n
within loop. You might additionally add WebDriverWait(d,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.inline-profile .username')))
initially at the end before collecting the other items to allow time after last click.
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
data = requests.get('https://boards.euw.leagueoflegends.com/en/search?query=improve&json_wrap=1').json()
total = data['searchResultsCount']
batch = data['resultsCount']
d = webdriver.Chrome()
d.get('https://boards.euw.leagueoflegends.com/en/search?query=improve')
counter = batch
while counter < total:
WebDriverWait(d, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.show-more-label'))).click()
counter +=batch
#print(counter)
userNames = [item.text for item in d.find_elements_by_css_selector('.inline-profile .username')]
topics = [item.text for item in d.find_elements_by_css_selector('.inline-profile + a')]
servers = [item.text for item in d.find_elements_by_css_selector('.inline-profile .realm')]
results = list(zip(userNames, topics, servers))
Interestingly, it does seem to stop updating before the given end count despite being able to click the button. This happens also when manually clicking.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.