简体   繁体   English

Python Expedia的网页抓取,如何找到正确的关键字

[英]Python web-scraping of Expedia, how to find right keyword

I am learning about Python scraper.我正在学习 Python 刮刀。 I take an simple exercise about find the cheap ticket in Expedia.我做了一个关于在 Expedia 上找到便宜机票的简单练习。 Now, I meet some problems about how to find the right selector or accurate keyword.现在,我遇到了一些关于如何找到正确的选择器或准确的关键字的问题。 I use functions like select() and find().我使用 select() 和 find() 之类的函数。 I took too many tests about them but I still did it successfully.我对它们进行了太多测试,但我仍然成功地做到了。 I always get empty list.我总是得到空列表。 How can I find the right selector or keywork in a better method?如何以更好的方法找到正确的选择器或键? There is a part of my code.我的代码有一部分。 In it, I try to find the location of the input of Place: Flying from and the button Roundtrip.在其中,我尝试找到 Place: Flying from 和按钮 Roundtrip 的输入位置。

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
    url="https://www.expedia.com/Flights?langid=1033&semcid=US.MULTILOBF.GOOGLE.GT-c-EN.FLIGHT&semdtl=a1355852835.b125535175035.r1.g1kwd-12197061.i1.d1280328929841.e1c.j120181.k1.f11t1.n1.l1g.h1e.m1&gclid=CjwKCAiAws7uBRAkEiwAMlbZjjBMg2bBBbp59C6tXeHf-4sXVvc4ya7EflIKQGsaFgENRP_SbaNQrRoCsUoQAvD_BwE"  
    address_page1 = requests.get(url, headers=headers).content
    soup = BeautifulSoup(address_page1,'html.parser')

    find = soup.find_all(id='flight-origin-flp-airport_code')
    print(find)

    select1 = soup.select('#flight-origin-flp-airport_code')
    print(select1)

    select2 = soup.select('#gcw-flights-form-flp > div.cols-nested.ab25184-location > div > div > div.input-btn-group')
    print(select2)

Your approach is fundamentally flawed.你的方法从根本上是有缺陷的。 Most of today's websites, including Expedia, are heavily JavaScript-based.当今的大多数网站,包括 Expedia,都大量基于 JavaScript。 The data you want may not even render on the page when you fetch it this way.当您以这种方式获取数据时,您想要的数据甚至可能不会呈现在页面上。 You probably want to use a framework similar to Puppeteer which emulates the entire browser.您可能想要使用类似于Puppeteer的框架来模拟整个浏览器。 A simple Python-based library will not be able to execute on-page JavaScript like your browser does.一个简单的基于 Python 的库将无法像您的浏览器那样在页面上执行 JavaScript。 If you want to stick to Python, there may be a Puppeteer wrapper, but you'd have a much easier time just using Puppeteer and JS directly.如果您想坚持使用 Python,可能会有 Puppeteer 包装器,但直接使用 Puppeteer 和 JS 会更容易。

Searching a part of page you need is easier with using browser developer tools, for example Chrome Ctrl + Shift + C.使用浏览器开发工具(例如 Chrome Ctrl + Shift + C)可以更轻松地搜索您需要的页面部分。

You can search manually in page source code in browser or in您可以在浏览器或在页面源代码中手动搜索

print(BeautifulSoup(requests.get(url).text, 'html.parser').prettify())

Your python request probably got 'We can't tell if you're a human or a bot.'您的 python 请求可能得到“我们无法判断您是人类还是机器人”。 page instead of normal expedia page.页面而不是普通的 expedia 页面。 You can try to reuse cookies from your browser with your requests session to get proper pages.您可以尝试在您的请求 session 的浏览器中重用 cookies 以获得正确的页面。

When you found node you needed visually you look for it's name and attributes, also for parent's name and attributes.当您找到所需的节点时,您可以直观地查找它的名称和属性,以及父节点的名称和属性。

Roundtrip example往返示例

result = soup.find_all('label')
roundtrip = None
for label in result:
    if not 'id' in label.attrs:
        continue
    if 'roundtrip' in label.attrs['id']:
        roundtrip = label
        break
print(roundtrip)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM