简体   繁体   中英

Python: retrieving co-ordinates from html code

I'm currently using basic image mapping (for shelf tests if anyone is into market research). I upload an image to an image mapping app and then manually insert boxes around particular products which generates co-ordinates. I have an excel sheet with a macro that once these co-ordinates are pasted into will create code that I can put directly into my surveys. Initially, the co-ordinates are embedded in html code like the following

    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="10,32,202,115" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="206,25,346,124" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="340,27,448,121" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="446,2,559,119" style="outline:none;" target="_self"     />
    <area shape="rect" coords="998,567,1000,569" alt="Image Map" style="outline:none;" 
    title="Image Map"/>

These can get very long I was just wondering how I could extract just the co-ordinates with each set printed on a new line. All I have currently is the following which simply returns all the numbers and only works if the pasted html code is all on one line

    import re
    html = '#paste html block onto one line'
    coords = re.findall("\d+", html)
    print(coords)

My ideal output here would be

10,32,202,115

206,25,346,124

340,27,448,121

446,2,559,119

998,567,1000,569

Any suggestions?

Beautiful Soup package for Python can do this very easily.

You can install it with pip install beautifulsoup4

Using the XML you provided (I saved it in a file called "stuff.xml"), the following code will get the coordinates for you:

from bs4 import BeautifulSoup

with open("stuff.xml") as xml_file:
    soup = BeautifulSoup(xml_file.read(), "html.parser")

coords = [area["coords"] for area in soup.find_all('area')]

print(coords)

# -> ['10,32,202,115', '206,25,346,124', '340,27,448,121', '446,2,559,119', '998,567,1000,569']

Using regular expressions on XML is gonna cause problems sooner rather than later, better to use an actual XML parser.

import re
    html = '#paste html block onto one line'
    coords = re.findall('\d+,\d+,\d+,\d+', html)
    for num in coords:
        print(num)

output from your HTML code above

10,32,202,115
206,25,346,124
340,27,448,121
446,2,559,119
998,567,1000,569

I hope I understand you

import re

test_str = ("   <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords = \"10,32,202,115\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords=\"206,25,346,124\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords=\"340,27,448,121\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords=\"446,2,559,119\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area shape=\"rect\" coords=\"998,567,1000,569\" alt=\"Image Map\" style=\"outline:none;\" \n"
    "    title=\"Image Map\"/>")

regex = r"coords.+?\"(.+?)\""

coords = re.findall(regex, test_str, re.MULTILINE | re.IGNORECASE)

print(coords)
#Output:['10,32,202,115','206,25,346,124','340,27,448,121','446,2,559,119','998,567,1000,569']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM