简体   繁体   中英

Error when scraping with Beautiful Soup and Selenium together

I'm a beginner in Python and web scraping in general. In this code I'm using both Bs4 and Selenium. I'm using Selenium to make clicking on the 'show more' button automatic,so that I can scrape all of the results, not just the ones found on the first page of results shown. I'm trying to scrape the following website: https://boards.euw.leagueoflegends.com/en/search?query=improve

However, when combining Bs4 and Selenium together, 3 fields that I'm scraping (username, server and topic) now give me the following two errors.

1) I get a AttributeError: 'NoneType' object has no attribute 'text' for both server and username

Traceback (most recent call last):
  File "failoriginale.py", line 153, in <module>
    main()
  File "failoriginale.py", line 132, in main
    song_data = get_songs(index_page) # Get songs with metadata
  File "failoriginale.py", line 81, in get_songs
    username = row.find(class_='username').text.strip()
AttributeError: 'NoneType' object has no attribute 'text'

2)I get this error with topic

Traceback (most recent call last):
  File "failoriginale.py", line 153, in <module>
    main()
  File "failoriginale.py", line 132, in main
    song_data = get_songs(index_page) # Get songs with metadata
  File "failoriginale.py", line 86, in get_songs
    topic = row.find('div', {'class':'discussion-footer byline opaque'}).find_all('a')[1].text.strip()
IndexError: list index out of range

However, before combining bs4 with Selenium, these 3 fields worked just like the others, so I think that the problem is elsewhere. I don't get it what's the problem in the main function with song_data? I have already looked up other questions on stackoverflow but I could not solve the problem. I'm new to scraping and bs4, selenium libraries so sorry if I'm asking a silly question.

Here's the code:

browser = webdriver.Firefox(executable_path='./geckodriver')
browser.get('https://boards.euw.leagueoflegends.com/en/search?query=improve&content_type=discussion')
html = browser.page_source #page_source is where selenium stores the html source

def get_songs(url):

    html = browser.page_source
    index_page = BeautifulSoup(html,'lxml') # Parse the page

    items = index_page.find(id='search-results') # Get the list on from the webpage
    if not items: # If the webpage does not contain the list, we should exit
        print('Something went wrong!', file=sys.stderr)
        sys.exit()
    data = list()
 # button show more, if the page has the show more button, it will click on that x5secs
    if index_page.find('a', {"class": "box show-more",}):
        button = browser.find_element_by_class_name('box.show-more')
        timeout = time.time() + 5
        while True:
            button.click()
            time.sleep(5.25)
            if time.time() > timeout:
                break

html = browser.page_source
    index_page = BeautifulSoup(html,'lxml')
    items = index_page.find(id='search-results')

    for row in items.find_all(class_='discussion-list-item'):

        username = row.find(class_='username').text.strip()
        question = row.find(class_='title-span').text.strip()
        sentence = row.find('span')['title']
        serverzone = row.find(class_='realm').text.strip()
        #print(serverzone)
        topic = row.find('div', {'class':'discussion-footer byline opaque'}).find_all('a')[1].text.strip()
        #print(topic)
        date=row.find(class_='timeago').get('title')
        #print(date)
        views = row.find(class_='view-counts byline').find('span', {'class' : 'number opaque'}).get('data-short-number')
        comments = row.find(class_='num-comments byline').find('span', {'class' : 'number opaque'}).get('data-short-number')

        # Store the data in a dictionary, and add that to our list
        data.append({
                     'username': username,
                     'topic':topic,
                     'question':question,
                     'sentence':sentence,
                     'server':serverzone,
                     'date':date,
                     'number_of_comments':comments,
                     'number_of_views':views
                    })
    return data
def get_song_info(url):
    browser.get(url)
    html2 = browser.page_source
    song_page = BeautifulSoup(html2, features="lxml")
    interesting_html= song_page.find('div', {'class' : 'list'})
    if not interesting_html: # Check if an article tag was found, not all pages have one
        print('No information availible for song at {}'.format(url), file=sys.stderr)
        return {}
    answer = interesting_html.find('span', {'class' : 'high-quality markdown'}).find('p').text.strip() #.find('span', {"class": "high-quality markdown",}).find('p')
    return {'answer': answer} # Return the data of interest



def main():
    index_page = BeautifulSoup(html,'lxml')
    song_data = get_songs(index_page) # Get songs with metadata
     #for each row in the improve page enter the link and extract the data  
    for row in song_data:
        print('Scraping info on {}.'.format(row['link'])) # Might be useful for debugging
        url = row['link'] #defines that the url is the column link in the csv file 
        song_info = get_song_info(url) # Get lyrics and credits for this song, if available
        for key, value in song_info.items():
            row[key] = value # Add the new data to our dictionary
    with open('results.csv', 'w', encoding='utf-8') as f: # Open a csv file for writing
        fieldnames=['link','username','topic','question','sentence','server','date','number_of_comments','number_of_views','answer'] # These are the values we want to store

Thanks for your help!

I would be tempted to use requests to retrieve the total results count and number of results per batch and loop clicking the button with a wait condition until all results present. Then grab them in one go so to speak. Outline below which can be re-written as required. You could always use an n end point to stop clicking after n pages and increment n within loop. You might additionally add WebDriverWait(d,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.inline-profile .username'))) initially at the end before collecting the other items to allow time after last click.

import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

data = requests.get('https://boards.euw.leagueoflegends.com/en/search?query=improve&json_wrap=1').json()
total = data['searchResultsCount']
batch = data['resultsCount']

d = webdriver.Chrome()
d.get('https://boards.euw.leagueoflegends.com/en/search?query=improve')

counter = batch
while counter < total:
    WebDriverWait(d, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.show-more-label'))).click()
    counter +=batch
    #print(counter)

userNames = [item.text for item in d.find_elements_by_css_selector('.inline-profile .username')]
topics = [item.text for item in d.find_elements_by_css_selector('.inline-profile + a')]
servers = [item.text for item in d.find_elements_by_css_selector('.inline-profile .realm')]
results = list(zip(userNames, topics, servers))

Interestingly, it does seem to stop updating before the given end count despite being able to click the button. This happens also when manually clicking.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM