Here is the source of the of the page I am looking for. Page Source . If page source is not working here is the link for the source only. "view-source: https://sports.bovada.lv/baseball/mlb "
Here is the Link: Link to page
I am not to familiar with using bs4
but here is the script below which works, but does not return anything I need.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://sports.bovada.lv/baseball/mlb/game-lines-market-group')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.prettify())
I can return the soup
just fine. But what see from just inspecting the site and the returned soup
are not the same.
Here is a sample of what I can see from inspect.
The goal is to remove the Team, pitcher, odds and total runs. Which I can clearly see in the inspect version. When I print
soup
that information does not come with.
Then I dove a little further and on the bottom of the Page source i can see an iFrame and below that it looks like json
dictionary with everything I am looking to extract but running a similar script to retrieve json data does not work like I had hoped:
import requests
req = requests.get('view-source:https://sports.bovada.lv//baseball/mlb/game-lines-market-group')
data = req.json()['itemList']
print(data)
I believe i should be using bs4
but I am confused on why the same html
is not being returned.
The data in json is dynamic which means it puts it into the HTML.
To access it with BS you need to access the var
contained in the source which contains the json data. then load it into json and you can access it from there.
This is from the link you gave from var swc_market_lists =
So in the source it will look like
<script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"136","link":"/baseball/mlb/game-lines-market-group","baseLink":"/baseball/mlb/game-lines-market-........
now you can use the swc_market_lists
in the pattern
regular expression to only return that script.
Use soup.find
to return just that section.
Because the .text
will include the var part I have returned the data from the start of the json string. In this case from 24
which is the first {
This means you now have a string of JSON data which you can then load as json and manipulate as required.
Hopefully you can work with this to find what you want
from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint
import re
def get_data():
url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
# res = soup.findAll('script') # find all scripts..
pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
script = soup.find("script", text=pattern)
return script.text[23:]
test1 = get_data()
json_data = json.loads(test1)
pprint(json_data['items'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.