简体   繁体   中英

Best way to scrape a list of data from a website with python

I'm scraping data from a web page for use in an API and looking for the most pythonic / appropriate way to do it - The page source has a list of dictionaries titled 'markerData' and I need to grab the lat and lng values.

Data Sample:

"markerData": [{"docEl":null,"lid":0,"clickable":true,"lat":34.0489281,"lng":-111.0937311,"title":"","iconURL":"//assets.bankofamerica.com/images/mapmarker2.png","info":"</div>View all locations in Arizona</a></div></div></div></div></div>"}, {"docEl":null,"lid":1,"clickable":true,"lat":35.20105,"lng":-91.8318334,"title":"","iconURL":"//assets.bankofamerica.com/images/mapmarker2.png","info":"</div>View all locations in Arkansas</a></div></div></div></div></div>"},

I've used python's lxml module a few times in the past for this kind of task, however seeing as my 'markerData' isn't an obvious html structure I'm trying to figure out the best way to proceed. Specifically, in the function below, I am stuck attempting to define the tree.xpath for each of my lat and lng values.

lats = []
lngs = []

def get_coordinates():

    i = 0

    while i < 35:

            page = requests.get('https://locators.bankofamerica.com/&check_list=4429#')
            tree = html.fromstring(page.content)

            lat = tree.xpath('//div[@id=mapWrap/markerData/lat/text()'.format(i))
            lng = tree.xpath('//div[@id=mapWrap/markerData/lng/text()'.format(i))

            str1 = ''.join(lat)
            str2 = ''.join(lng)

            lats.append(str1)
            lngs.append(str2)

            i += 1

    return lats, lngs

I also can't fight the feeling there might be an easier way to do this such as reading the entire page-source into a text file and trying to grab just the 'markerData' list.

I would appreciate any help in defining an xpath for my lat and lng values, or any alternative ideas on how to isolate and capture this data.

Here's the function I wrote that got the job done for me in case it might help someone else in a similar situation:

def get_coordinates():

        page = requests.get('https://locators.bankofamerica.com/&check_list=4429')
        tree = html.fromstring(page.content)

        lat_lng = tree.xpath("//script[contains(., 'markerData')]/text()")
        lat_lng_string = str(lat_lng)
        latitude = re.findall('"lat":\d+\.\d+', lat_lng_string)
        longitude = re.findall('"lng":-\d+\.\d+', lat_lng_string)

        la = re.findall('\d+\.\d+', str(latitude))
        lo = re.findall('-\d+\.\d+', str(longitude))

        coords = dict(zip(la, lo))

        return coords

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM