简体   繁体   中英

Using requests with bs4 and or json

Here is the source of the of the page I am looking for. Page Source . If page source is not working here is the link for the source only. "view-source: https://sports.bovada.lv/baseball/mlb "

Here is the Link: Link to page

I am not to familiar with using bs4 but here is the script below which works, but does not return anything I need.

import requests
from bs4 import BeautifulSoup

r = requests.get('https://sports.bovada.lv/baseball/mlb/game-lines-market-group')
soup = BeautifulSoup(r.content, 'lxml')

print(soup.prettify())

I can return the soup just fine. But what see from just inspecting the site and the returned soup are not the same.

Here is a sample of what I can see from inspect. 检查页面

The goal is to remove the Team, pitcher, odds and total runs. Which I can clearly see in the inspect version. When I print soup that information does not come with.

Then I dove a little further and on the bottom of the Page source i can see an iFrame and below that it looks like json dictionary with everything I am looking to extract but running a similar script to retrieve json data does not work like I had hoped:

import requests

req = requests.get('view-source:https://sports.bovada.lv//baseball/mlb/game-lines-market-group')
data = req.json()['itemList']
print(data)

I believe i should be using bs4 but I am confused on why the same html is not being returned.

The data in json is dynamic which means it puts it into the HTML.

To access it with BS you need to access the var contained in the source which contains the json data. then load it into json and you can access it from there.

This is from the link you gave from var swc_market_lists =

So in the source it will look like

<script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"136","link":"/baseball/mlb/game-lines-market-group","baseLink":"/baseball/mlb/game-lines-market-........

now you can use the swc_market_lists in the pattern regular expression to only return that script.

Use soup.find to return just that section.

Because the .text will include the var part I have returned the data from the start of the json string. In this case from 24 which is the first {

This means you now have a string of JSON data which you can then load as json and manipulate as required.

Hopefully you can work with this to find what you want

from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint

import re


def get_data():

    url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
    r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
    html_bytes = r.text
    soup = bs4(html_bytes, 'lxml')

    # res = soup.findAll('script') # find all scripts..

    pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
    script = soup.find("script", text=pattern)

    return script.text[23:]

test1 = get_data()

json_data = json.loads(test1)

pprint(json_data['items']) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM