I'm trying to parse data from a json table from this website.
url - https://boxes.mysubscriptionaddiction.com/subscription_boxes_for/food .
I primarily need the name, ratings and descriptions of all the food subscription boxes listed. I'm facing a few challenges here. One is that there are 2 views to the table - grid and list view. How do we specify which table view we are referring to in our code? Second is that I am getting a
ValueError - Timeout value connect was Timeout(connect=<object object at 0x000002767CECD5C0>,
read=<object object at 0x000002767CECD5C0>, total=None), but it must be an int, float or None.
Not sure what this means.
My code:
from pandas.io.html import read_html
from selenium import webdriver
import json
import requests
import os
import sys
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox(executable_path='C:\Drivers\geckodriver.exe')
driver.get('https://boxes.mysubscriptionaddiction.com/subscription_boxes_for/food')
table = driver.find_element_by_xpath('/html/body/div[3]/div/span/div[2]/div/div[1]/div[3]/div[3]/table')
table_html = table.get_attribute('innerHTML')
bs = BeautifulSoup(table_html, 'html.parser')
rows = bs.select('tbody tr')
print(bs)
Here is how to get the data you are looking for: ( data
is a dict that contains the information)
import requests
from bs4 import BeautifulSoup
import json
scrape_url = 'https://boxes.mysubscriptionaddiction.com/subscription_boxes_for/food'
r1 = requests.get(scrape_url)
page = r1.content
soup = BeautifulSoup(page, 'html.parser')
scripts = soup.find_all('script')
data_str = scripts[11].contents[0].strip()
data = json.loads(data_str,strict=False)
print(data['itemListElement'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.