简体   繁体   English

谷歌地图使用硒的地方ID

[英]Google maps place id using selenium

from selenium import webdriver
import re
driver= webdriver.Chrome(executable_path=r"C:\Users\chromedriver")
sentence = "chiropractor in maryland"
url="https://google.com/search?hl=en&q={}".format(sentence)
driver.get(url)
links=driver.find_elements_by_xpath('//a[@href]')
maps=[i for i in links if i.text=="Maps"][0].click()
html=driver.page_source
#ChIJaYGxdRj9t4kRcJmJlvQkKX0
#ChIJCf4MzWjgt4kRluBnhQTHlBM
#ChIJBXxr8brIt4kRVE-gIYDyV8c
#ChIJX0W_Xo4syIkRUAtRFy8nz1Y place ids in html

Hello, this is my first selenium project I am trying to find the places ids from result I have added some of place id (i got using API), I tried to find them in inspector tools but I couldn't,however, they are available in the page source I tried using regex it seems that they follow the following path你好,这是我的第一个 selenium 项目我试图从结果中找到地点 ID 我已经添加了一些地点 ID(我使用 API),我试图在检查器工具中找到它们,但我不能,但是,它们是在我尝试使用正则表达式的页面源中可用,它们似乎遵循以下路径

2,[null,null,\\"bizbuilder:gmb_web\\",[6,7,4,1,3]\\n]\\n]\\n]\\n,1,null,null,null,null,null,null,[\\"-8523065488279764631\\",\\"9018780361702349168\\"]\\n]\\n]\\n]\\n,null,null,null,[[\\"chiropractor\\"]\\n]\\n,null,\\"ChIJaYGxdRj9t4kRcJmJlvQkKX0\\",null,null,null,[\\"South Gate\\",\\"806 Landmark Dr Suite 126\\",\\"806 Landmark Dr Suite 126\\",\\"Glen Burnie\\"]\\n,null,null,null,null,null,[null,\\"SearchResult.TYPE_PERSONAL_

after "\\"chiropractor\\"]\\n]\\n,null,\\" Place ID ",null ...在 "\\"脊椎按摩师\\"]\\n]\\n,null,\\" Place ID ",null ...

but I can't find the regex for it.但我找不到它的正则表达式。 I need help writing the correct regex or find another way of finding palce_id.我需要帮助编写正确的正则表达式或找到另一种查找 palce_id 的方法。 I hope that no one answers with refer to using their API我希望没有人提到使用他们的 API

I think this could be improved but the string itself sits in a script tag that has window.APP_OPTIONS in it.我认为这可以改进,但字符串本身位于一个包含window.APP_OPTIONS的脚本标记中。 Each of those ids starts with ChIJ , has a defined character set following and is of length 27 in total.这些 id 中的每一个都以ChIJ开头, ChIJ有一个定义的字符集,总长度为 27。

I have also started directly with the map page rather than click to it.我也直接从地图页面开始,而不是点击它。 I didn't need a wait condition despite several runs.尽管运行了几次,但我不需要等待条件。 This could be added if wanted/required.如果需要/需要,可以添加它。

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re

sentence = "chiropractor in maryland"
url = 'https://www.google.com/maps/search/{}'.format(sentence)
d = webdriver.Chrome()
d.get(url)
soup = bs(d.page_source, 'lxml')

for script in soup.select('script'):
    if 'window.APP_OPTIONS' in script.text:
        script = script.text
        break    
r = re.compile(r'(ChIJ[a-zA-Z\.0-9\-\_]{23})')
items = r.findall(script)
print(items)

d.quit()

A little riskier you could work off page_source direct您可以直接使用 page_source 工作,但风险更大

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re

sentence = "chiropractor in maryland"
url = 'https://www.google.com/maps/search/{}'.format(sentence)
d = webdriver.Chrome()
d.get(url)
r = re.compile(r'(ChIJ[a-zA-Z\.0-9\-\_]{23})')
items = r.findall(d.page_source)
print(items)

d.quit()

Notes:笔记:

I am specifying a pattern designed to only match the required items currently (for given search).我正在指定一个模式,旨在仅匹配当前所需的项目(对于给定的搜索)。 It is conceivable, in future/new searches, that pattern could occur and not be an id.可以想象,在未来/新的搜索中,该模式可能会出现而不是 id。 The page_source is a larger search space and therefore a greater likelihood of encountering an unwanted string that matches the pattern. page_source 是一个更大的搜索空间,因此遇到与模式匹配的不需要的字符串的可能性更大。 The script tag is not only where you would expect to find the ids but is also a smaller search space.脚本标签不仅是您希望找到 ID 的地方,而且还是一个较小的搜索空间。 Over time you might also want to check character set does not require any additional characters for matching new ids.随着时间的推移,您可能还想检查字符集不需要任何额外的字符来匹配新的 id。 You can easily check against the result per page count.您可以轻松检查每页计数的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM