简体   繁体   中英

Specific xPath and Regex - Web Crawling

I'm currently in the process of trying to scrape a website. The problem is the information is placed on google maps in an iframe. Specifically, Latitude and Longitude.

I'm able to get all the other information I currently need expect this. Searching around, and working with import.io tech support, I found I need to use specific xPath and Regex to pull this information but the code I found on the site has me lost. Ideally I'd like to pull Latitude and Longitude separately. This is the code I have to work with.

What are my options? Thank you.

<div class="padding-listItem--sm">
  <iframe width="100%" height="310" frameborder="0" allowfullscreen="" src="https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&amp;key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8" style="border:0"></iframe>
</div>

1) Get the src attribute of the iframe element.

string srcText = driver.findElement(By.tagName("iframe")).getAttribute("src");

2) Parse the url (found in srcText ) for the latitude and longitude values.

Regex to find both numbers:

/([-]?\d+\.\d+)/g

when the url is as you specified:

https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&amp;key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8"

The XPath to obtain the iframe source is:

//div[@class='padding-listItem--sm']/iframe/@src

Then you can apply a regex like this one to obtain latitude and longitude

 /q=(-?[\d\.]*),(-?[\d\.]*)/g

Implementation online Here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM