Not able to scrape information from a website using lxml

Question

I am trying to scrape data of user review beer on beeradvocate.com to analyze user attitude towards different of beer type. But I can only have result of the first few page, remain is empty

Situation:

There are 500 different type of beer, each beer has different number of rating and reviews
Site only show 1 page of results for guest, to see all the information, you need to login

My approach

Get the beer link, number of rating of each beer to define range of loop for each beer
Login using request session and post

def review_scrape (beer_link, number_of_ratings):
    reviews=[]
    rate =[]
    for pages_i in range(0,int(number_of_ratings),25): #site shows 25 resulst/page
        session = requests.session() # Start the session
        payload = {'login':'suzie102', 'password':''}
        page1 = session.post("https://www.beeradvocate.com/community/login/login", data=payload) 
        url = beer_link+'/?view=beer&sort=&start=%d'%(pages_i)
        page1= session.get(url)
        time.sleep(3)
        soup1 = lxml.html.fromstring(page1.text)
        rate_i = soup1.xpath('//span[@class = "muted"]/text()')[8::3]
        print(url)
        reviews_i = soup1.xpath('//div/text()')
        reviews.append(reviews_i)
        print(len(reviews))
        rate.append(rate_i)
    return rate,reviews

Results:

Answer 1

There is only one problem that ive see.

 url = beer_link+'/?view=beer&sort=&start=%d'%(pages_i)

/ is redudant, what you need is url = beer_link+'?view=beer&sort=&start=%d'%(pages_i)

that is why there are //?view in your print of links.

I can see that there are anchor links "next" leding to next page. I would recommend while loop or recursion.

Other than that, I cant see what is missing from your script. Everything else looks in order and it should work.

If you could give us more details, we might have more to work with.

Answer 2

update, thanks to everyone comment, I tried to use it with selenium to scrape. It works now

def webstite_scrape_p2 (beer_link, number_of_ratings):

    driver = webdriver.Chrome('/home/sam/Downloads/chromedriver')
    url = 'https://www.beeradvocate.com/community/login/'
    driver.get(url)
    
    loginelement = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dd//input[@name ="login"]')))
    loginelement.send_keys('suzie102')

    pwelement = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dl[@class ="ctrlUnit"]//dd//ul//li[@id = "ctrl_pageLogin_registered_Disabler"]//input[@name ="password"]')))
    pwelement.send_keys('')

    page_click = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//form[@class="xenForm formOverlay"]//dl[@class ="ctrlUnit submitUnit"]//dd//input[@type ="submit"]')))
    page_click.click()

    rate = []
    reviews =[]
    avg_user =[]
    for link, i in zip(beer_link, number_of_rev): 
        for pages_i in tqdm(range(0,int(i),25)): #site shows 25 resulst/page)
            new_url = link+'?view=beer&sort=&start=%d'%(pages_i)
            print(new_url)
            driver.get(new_url)
            
            #print(driver.find_element_by_name("hideRatings").is_selected())
            #check_box = WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.XPATH, '//form[@style="display:inline;margin:0;padding:0;"]//input[@type = "checkbox"]')))#check_box.click()
            #check_box.click()
            time.sleep(5)
            driver.get(new_url)
            page_source = driver.page_source
    
            soup = BeautifulSoup(page_source,'html.parser')
        
            rate_i = [ i.get_text() for i in soup.find_all('span', class_ = "muted")][8::3]
            rate.append(rate_i)
            
            reviews_i = [ i.get_text() for i in soup.find_all('div')]
            reviews.append(reviews_i)
            
            avg_i = [i.get_text() for i in soup.find_all('span', class_= "BAscore_norm")]
            avg_user.append(avg_i)
    return rate, reviews, avg_user

Not able to scrape information from a website using lxml

Question

2 answers

solution1
0 2020-09-17 23:20:08

solution2
0 ACCPTED 2020-09-20 22:26:34

Not able to scrape information from a website using lxml

Question

2 answers

solution1 0 2020-09-17 23:20:08

solution2 0 ACCPTED 2020-09-20 22:26:34

solution1
0 2020-09-17 23:20:08

solution2
0 ACCPTED 2020-09-20 22:26:34