Python: retrieving co-ordinates from html code

Question

I'm currently using basic image mapping (for shelf tests if anyone is into market research). I upload an image to an image mapping app and then manually insert boxes around particular products which generates co-ordinates. I have an excel sheet with a macro that once these co-ordinates are pasted into will create code that I can put directly into my surveys. Initially, the co-ordinates are embedded in html code like the following

    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="10,32,202,115" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="206,25,346,124" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="340,27,448,121" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="446,2,559,119" style="outline:none;" target="_self"     />
    <area shape="rect" coords="998,567,1000,569" alt="Image Map" style="outline:none;" 
    title="Image Map"/>

These can get very long I was just wondering how I could extract just the co-ordinates with each set printed on a new line. All I have currently is the following which simply returns all the numbers and only works if the pasted html code is all on one line

    import re
    html = '#paste html block onto one line'
    coords = re.findall("\d+", html)
    print(coords)

My ideal output here would be

10,32,202,115

206,25,346,124

340,27,448,121

446,2,559,119

998,567,1000,569

Any suggestions?

Answer 1

Beautiful Soup package for Python can do this very easily.

You can install it with pip install beautifulsoup4

Using the XML you provided (I saved it in a file called "stuff.xml"), the following code will get the coordinates for you:

from bs4 import BeautifulSoup

with open("stuff.xml") as xml_file:
    soup = BeautifulSoup(xml_file.read(), "html.parser")

coords = [area["coords"] for area in soup.find_all('area')]

print(coords)

# -> ['10,32,202,115', '206,25,346,124', '340,27,448,121', '446,2,559,119', '998,567,1000,569']

Using regular expressions on XML is gonna cause problems sooner rather than later, better to use an actual XML parser.

Answer 2

import re
    html = '#paste html block onto one line'
    coords = re.findall('\d+,\d+,\d+,\d+', html)
    for num in coords:
        print(num)

output from your HTML code above

10,32,202,115
206,25,346,124
340,27,448,121
446,2,559,119
998,567,1000,569

I hope I understand you

Answer 3

import re

test_str = ("   <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords = \"10,32,202,115\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords=\"206,25,346,124\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords=\"340,27,448,121\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area  alt=\"\" title=\"\" href=\"http://www.image-maps.com/\" shape=\"rect\" \n"
    "    coords=\"446,2,559,119\" style=\"outline:none;\" target=\"_self\"     />\n"
    "    <area shape=\"rect\" coords=\"998,567,1000,569\" alt=\"Image Map\" style=\"outline:none;\" \n"
    "    title=\"Image Map\"/>")

regex = r"coords.+?\"(.+?)\""

coords = re.findall(regex, test_str, re.MULTILINE | re.IGNORECASE)

print(coords)
#Output:['10,32,202,115','206,25,346,124','340,27,448,121','446,2,559,119','998,567,1000,569']

Python: retrieving co-ordinates from html code

Question

3 answers

solution1
1 ACCPTED 2022-02-16 00:53:19

solution2
0 2022-02-16 00:36:40

solution3
0 2022-02-16 01:15:40

Python: retrieving co-ordinates from html code

Question

3 answers

solution1 1 ACCPTED 2022-02-16 00:53:19

solution2 0 2022-02-16 00:36:40

solution3 0 2022-02-16 01:15:40

solution1
1 ACCPTED 2022-02-16 00:53:19

solution2
0 2022-02-16 00:36:40

solution3
0 2022-02-16 01:15:40