简体   繁体   English

使用 BeautifulSoup (Python) 进行 HTML 抓取

[英]Stuck with HTML scraping using BeautifulSoup (Python)

["

I want to convert activities uploaded to strava to a .gpx file.<\/i>我想将上传到 strava 的活动转换为 .gpx 文件。<\/b><\/p>

To do this I need to scrape strava activity HTML page for the elevation, longitude, latitude, etc... This is stored within the <div data-react-class=<\/code> line.<\/i>为此,我需要为海拔、经度、纬度等抓取 strava 活动 HTML 页面...这存储在<div data-react-class=<\/code>行中。<\/b> I have included an extract of the website code below.<\/i>我在下面包含了网站代码的摘录。<\/b> I only care about the information from {"activity":{"name"<\/code> onwards<\/i>我只关心从{"activity":{"name"<\/code>开始的信息<\/b><\/p>

       </li>
      </ul>
     </div>
    </nav>
   </header>
   <div data-react-class="ActivityPublic" data-react-props='{
  "activity": {
    "name": "Morning Ride",
    "date": "Today",
    "athlete": {
      "name": "James Whyard",
      "avatarUrl": "https://lh3.googleusercontent.com/a-/AOh14GiA8yxgfozOqSJEiwW9srS-VEZU_mV_UM2iHFZxjw=s96-c",
      "location": "",
      "followersCount": 3,
      "followAthleteUrl": "http://www.strava.com/register?activity_action=athlete\u0026activity_id=7487240518\u0026athlete_id=90220142\u0026content=90220142\u0026cta=follow\u0026element=button\u0026follow_athlete_after_login=true\u0026follow_athlete_after_registration=true\u0026follow_athlete_id=90220142\u0026source=activities_show",
      "totalDistance": "452",
      "distanceUnit": "miles",
      "totalActivities": 40
    },
    "type": "Ride",
    "detailedType": "Ride",
    "kudosCount": 0,
    "comments": [],
    "commentCount": 0,
    "achievementsCount": 11,
    "distance": "11.7 mi",
    "time": "49:38",
    "elevation": "246 ft",
    "calories": 526.0,
    "streams": {
      "altitude": [6.6, 6.6, 6.6, 6.7, 6.7, 6.7, 6.7, 6.7, 6.7, 6.9, 6.7, 6.6, 6.5, 6.4, 6.4, 6.4, 6.4, 6.2, 5.9, 6.0, 5.9, 5.8, 5.7, 5.6, 5.6, 5.6, 5.7, 5.9, 6.0, 6.0, 5.9, 5.9, 5.9, 6.0, 6.0, 6.0, 6.0, 6.0, 6.1, 6.2, 6.2, 6.4, 6.5, 6.5, 6.6, 6.9, 7.2, 7.2, 7.4
["

You might use .get<\/code> on element to get attribute value, that is<\/i>您可以在元素上使用.get<\/code>来获取属性值,即<\/b><\/p>

import requests
from bs4 import BeautifulSoup

url = 'https://www.strava.com/activities/7487240518'
urlr = requests.get(url)

soup = BeautifulSoup(urlr.content, 'html.parser')

divdata = soup.find('div', {'data-react-class':'ActivityPublic'})
strdata = divdata.get('data-react-props')
print(strdata)

Your very close with this!你非常接近这个!

What I would do is grab the div element as your are doing then get the data-react-props property that contains all the data your looking for.我要做的就是像您一样抓取 div 元素,然后获取包含您要查找的所有数据的data-react-props属性。 This is clearly formatted in json so we can interpret as such and get all the information we need from it from there..这在 json 中明确格式化,因此我们可以这样解释并从那里获取我们需要的所有信息..

import requests
import json
from bs4 import BeautifulSoup
import csv

url = 'https://www.strava.com/activities/7487240518'
urlr = requests.get(url)

soup = BeautifulSoup(urlr.content, 'html.parser')

divdata = soup.find('div', {'data-react-class':'ActivityPublic'})
activity_data = divdata.get("data-react-props")
activity_dict = json.loads(activity_data)

print("My rides elevation was:", activity_dict['activity']['elevation'])

Edit : @It_is_Chris suggested using the Strava API instead, https://developers.strava.com/docs/reference/ .编辑:@It_is_Chris 建议改用 Strava API, https://developers.strava.com/docs/reference/ This seems like a better alternative.这似乎是一个更好的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM