简体   繁体   English

无法从网页上抓取不同项目的标题

[英]Can't scrape the titles of different items from a webpage

I've written a script in python to get the titles of different duponts from a webpage. 我在python中编写了一个脚本来从网页上获取不同duponts的标题。 The content are static as they are available in the page source. 内容是静态的,因为它们在页面源中可用。 However, I can't grab them using the following approach. 但是,我无法使用以下方法获取它们。 How can I get them? 我怎么能得到它们?

My script so far: 我的脚本到目前为止:

import requests
from bs4 import BeautifulSoup

url = 'https://www.pagesjaunes.fr/recherche/paris-75/dupont'

res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("a.denomination-links"):
    print(item.text)

Output I'm expecting like: 输出我期待如下:

Dupont Versailles
Dupont Guillaume
Brigitte Dupont-Clair

and so on--- 等等 - -

Below code, FYI: 下面的代码,FYI:

import requests
from bs4 import BeautifulSoup

url = 'https://www.pagesjaunes.fr/recherche/paris-75/dupont'

headers = {

    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6",
    "cache-control": "no-cache",
    "cookie": "pjtmctxv1=8c557f78-092c-4060-c023-901b138ab1f5-b9488f20-7d96#4c8a23d4-e6ea-4229-9f53-10ff868cbec2#W600E#N#855eda61-8908-4d35-94ff-d5eb960d6feb#W6E7E#20190121#158ac3fcafc8826c4c8fbe0af06fa7e3; lieuAccueil=L07505600%7C2; VisitorID=034154805024253581; OAX=f614c8828d50f3b1; atidvisitor=%7B%22name%22%3A%22atidvisitor%22%2C%22val%22%3A%7B%22vrn%22%3A%22-483323-%22%2C%22at%22%3A%22%22%7D%2C%22options%22%3A%7B%22path%22%3A%22%2F%22%2C%22session%22%3A15724800%2C%22end%22%3A15724800%7D%7D; gig_hasGmid=ver2; pj_policy_cookie=1; datadome=.Fzep1tCSZBRXqwe.._w-7IF0s-2Hn.moLNN1WM73NP",
    "dnt": "1",
    "pragma": "no-cache",
    "referer": "https://c.datadome.co/captcha/?initialCid=AHrlqAAAAAMAjgK4-OSkf1AAy1rpIg%3D%3D&hash=A65A6A87BC0859C4FF51E5DA22F5E2&cid=.Fzep1tCSZBRXqwe.._w-7IF0s-2Hn.moLNN1WM73NP",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

res = requests.get(url, headers=headers)
print(res.text)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("a.denomination-links.pj-link"):
    print(item.get("title"))

I found it working as well. 我发现它也有效。 I've added the two main headers responsible to fetch the valid result. 我添加了两个负责获取有效结果的主标头。

import requests
from bs4 import BeautifulSoup

url = 'https://www.pagesjaunes.fr/recherche/paris-75/dupont'

res = requests.get(url,headers={
    "accept-language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
})
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("a.denomination-links"):
    print(item.get("title")) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM