简体   繁体   English

我无法从美丽的汤中获取所有 html 数据

[英]I can't get all the html data from beautiful soup

Im new in webscraping and i wanted to get just a text from a google page (basically the date of a soccer match), but the soup doesnt get all the html (im gessing beacause of request) so i can't find it, I know it can be beacause of google using javascript and I should use selenium chromedriver, but the thing is that I need the code to be usable on an another computer so it cant really use it..我是网络抓取的新手,我只想从谷歌页面获取文本(基本上是足球比赛的日期),但是汤没有得到所有的 html(我因为请求而在猜测)所以我找不到它,我知道这可能是因为谷歌使用 javascript 而我应该使用 selenium chromedriver,但问题是我需要代码可以在另一台计算机上使用,所以它不能真正使用它..

heres the code:继承人的代码:

import pandas as pd
from bs4 import BeautifulSoup
import requests

a = "Newcastle"
url ="https://www.google.com/search?q=" + a + "+next+match"

response = requests.get(url)
soup = BeautifulSoup(response.text,"html.parser")

print(soup)

for a in soup.findAll('div') :
    print(soup.get_text())

what i wanna find is我想找到的是

"<span class="imso_mh__lr-dt-ds">17/12, 13:30</span>"

it has它有

"//*[@id="sports-app"]/div/div[3]/div[1]/div/div/div/div/div[1]/div/div[1]/div/span[2]"

as xpath作为 xpath

Is it even possible?有可能吗?

Try to set User-Agent header when requesting the page from Google:从 Google 请求页面时尝试设置User-Agent header:

import requests
from bs4 import BeautifulSoup


a = "Newcastle"
url = "https://www.google.com/search?q=" + a + "+next+match&hl=en"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0"
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

next_match = soup.select_one('[data-entityname="Match Header"]')
for t in next_match.select('[aria-hidden="true"]'):
    t.extract()

text = next_match.get_text(strip=True, separator=" ")
print(text)

Prints:印刷:

Club Friendlies · Dec 17, 13:30 Newcastle VS Vallecano

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM