简体   繁体   English

使用 Python 和 BeautifulSoup 抓取多边形的坐标

[英]Webscraping coordinates of a polygon with Python and BeautifulSoup

I'm trying to scrape information from this webpage and many similar, https://knowyourcity.info/settlement/1846/5119249我正在尝试从该网页和许多类似网页中抓取信息, https://knowyourcity.info/settlement/1846/5119249

When viewing the page source the coordinates for the polygon at the top of the page are available but not when inspecting the polygon element.查看页面源时,页面顶部的多边形坐标可用,但在检查多边形元素时不可用。 Would anyone know how to scrape these coordinates into a column of a dataframe using BeautifulSoup package in python?有谁知道如何在 Z236EEEB4347BDD7526 中使用 BeautifulSoup package 将这些坐标刮到 dataframe 的列中?

This is the code I used to access the website这是我用来访问网站的代码

from requests import get
url = 'http://knowyourcity.info/settlement/1846/5119249'
response = get(url)
print(response.text[:500])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, "html.parser")
type(html_soup)

It looks like the map is driven by the variable settlement.看起来 map 是由变量结算驱动的。 Therefore one option is to loop through all the scripts tags and search for var settlement .因此,一种选择是遍历所有scripts标签并搜索var settlement Once you've found the variable, use simple find and subscribing to get the variable data.找到变量后,使用简单的查找和订阅来获取变量数据。 Convert this to json and then return the boundaries.将其转换为 json 然后返回边界。

The example is for illustration purposes.该示例用于说明目的。 You'll most likely want to refactor the code:您很可能想要重构代码:

from requests import get
from bs4 import BeautifulSoup
import json

def getHtml():
    url = 'http://knowyourcity.info/settlement/1846/5119249'
    response = get(url)
    return response.text

def extractBoundaries(html):
    html_soup = BeautifulSoup(html, "html.parser")
    scripts = html_soup.find_all('script')
    
    for script in scripts:
        startFind = "var settlement = "
        endFind = "};"

        if script.contents and startFind in script.contents[0]:
            scriptText = script.contents[0]
            startIndex = scriptText.find(startFind) + len(startFind)
            endIndex = scriptText.find(endFind) + len(endFind) -1
            settlementData = scriptText[startIndex:endIndex]
            jsonData = json.loads(settlementData)
            return jsonData['verification/A0_Boundary']

html = getHtml()
results = extractBoundaries(html)
print(results)

Output: Output:

5.599769999885609 -0.224459999729163 0 0;5.599920830581937 -0.2235293057328249 0 0;5.600343984087658 -0.2220772405721618 0 0;5.600582171330188 -0.2212706242398781 0 0;5.600757735181389 -0.2203650797845285 0 0;5.600943331869303 -0.2195227513738018 0 0;5.601229999764712 -0.2178069995933356 0 0;5.601684627743396 -0.2160719483616731 0 0;5.602178000314495 -0.215115999603654 0 0;5.60277082980997 -0.213977987593978 0 0;5.60322584449716 -0.2131045282513355 0 0;5.603939996133988 -0.2117290691411995 0 0;5.604261867990886 -0.2111080629866819 0 0;5.604746000027944 -0.210174000129939 0 0;5.605512212518647 -0.208745954062465 0 0;5.605957084651777 -0.2079168151088879 0 0;5.60642700020594 -0.2070410004417909 0 0;5.606837000227415 -0.2063009995914058 0 0;5.607503034537444 -0.2072989224072899 0 0;5.608332999968013 -0.2085879998362543 0 0;5.608940827457275 -0.2094694811315776 0 0;5.609384837140567 -0.2101133921192968 0 0;5.609949999892649 -0.210933000057878 0 0;5.610520744736618 -0.2114266172445696 0 0;5.61105999981919 -0.2118930002616821 0 0;5.612419000436546 -0.2126160003281257 0 0;5.613144659798252 -0.2126897915006225 0 0;5.614907000058054 -0.2128690003040674 0 0;5.615398000217567 -0.2144450001366067 0 0;5.615173904452149 -0.2159211302559356 0 0;5.614935501372315 -0.2174915048290131 0 0;5.61470415976919 -0.2190153628686744 0 0;5.614495076386731 -0.2203926071330784 0 0;5.61425499966856 -0.2219740001999071 0 0;5.613865981729703 -0.2233052558328268 0 0;5.613273865396593 -0.2253315354219581 0 0;5.612689000297166 -0.227333000017893 0 0;5.611838309990048 -0.2274067552175438 0 0;5.611219650166788 -0.2272163984180224 0 0;5.610458222968646 -0.2271212195685735 0 0;5.609547010985807 -0.2272079061199293 0 0;5.608730734136145 -0.2266937097468826 0 0;5.607481517358167 -0.2262178181977106 0 0;5.605377060602905 -0.2259990644052436 0 0;5.603420000032998 -0.2258499999774699 0 0;5.602499999875136 -0.2257000002547329 0 0;5.601491149397077 -0.225320574484897 0 0;5.599769999885609 -0.224459999729163 0 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM