簡體   English   中英

使用 Python 抓取檢查元素和動態網頁

[英]Scraping Inspect Element and Dynamic Webpage using Python

我正在嘗試從https://www.thehindu.com/life-and-style/travel/the-embers-of-war/article29202579.ece獲取新聞內容

實際上,我尋找模式只是為了獲取新聞內容。我使用 Inspect Element 尋找模式並找到了。 所有新聞內容都在一個 div 標簽中,該標簽具有相同的類名,即“_yeti_done”。 我的目標是僅抓取該新聞內容。

例如,

<div id="content-body-14269002-29202579" style="display: block;" class="_yeti_done"> Tom Cruise film is releaseing tommorrow... </div>

但是當我使用請求庫抓取html 內容時,它只打印 div id 而不是類名。 喜歡,

<div id="content-body-14269002-29202579"> Tom Cruise film is releaseing tommorrow... </div>

搜索答案后,我發現 javascript 是動態加載 html 的,當我們運行此代碼時,它不包含在 html 中 -

requests.get('https://www.example.com')

所以,我看硒。 這是我的代碼 -

from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path = r'C:\Users\softloft\AppData\Local\Programs\Python\Python37\Scripts\phantomjs-2.1.1-windows\bin\phantomjs')
print(driver)
driver.get("https://www.example.com")
p_element = driver.find_element_by_class_name('_yeti_done')
print(p_element.text)

和輸出 -

<selenium.webdriver.phantomjs.webdriver.WebDriver (session="8ea980d0-c403-11e9-83fd-89667b66501a")>
NoSuchElementException                    Traceback (most recent call last)
<ipython-input-44-bda7935df3c4> in <module>
  6 print(driver)
  7 driver.get("https://www.thehindu.com/business/Industry/hyundai-drives-in-grand-i10-nios-at-499-lakh/article29178286.ece")
----> 8 p_element = driver.find_element_by_class_name('_yeti_done')
  9 print(p_element.text)
~\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py in find_element_by_class_name(self, name)
562             element = driver.find_element_by_class_name('foo')
563         """
--> 564         return self.find_element(by=By.CLASS_NAME, value=name)
565 
566     def find_elements_by_class_name(self, name):
~\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py in find_element(self, by, value)
976         return self.execute(Command.FIND_ELEMENT, {
977             'using': by,
--> 978             'value': value})['value']
979 
980     def find_elements(self, by=By.ID, value=None):
~\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
319         response = self.command_executor.execute(driver_command, params)
320         if response:
--> 321             self.error_handler.check_response(response)
322             response['value'] = self._unwrap_value(
323                 response.get('value', None))
~\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
240                 alert_text = value['alert'].get('text')
241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
243 
244     def _value_or_default(self, obj, key, default):
NoSuchElementException: Message: {"errorMessage":"Unable to find element with class name '_yeti_done'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Content-Length":"99","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:55049","User-Agent":"selenium/3.141.0 (python windows)"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"class name\", \"value\": \"_yeti_done\", \"sessionId\": \"8ea980d0-c403-11e9-83fd-89667b66501a\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/8ea980d0-c403-11e9-83fd-89667b66501a/element"}}
Screenshot: available via screen

如何解決這個問題,有沒有辦法在不考慮 selenium 的情況下獲取帶有類名的 div 標簽。 謝謝

使用id代替class
_yeti_done類可以更改。

import re

import requests
from bs4 import BeautifulSoup

req = requests.get('https://www.thehindu.com/life-and-style/travel/the-embers-of-war/article29202579.ece')
soup = BeautifulSoup(req.text, 'html.parser')

content = soup.find('div', attrs={'id': re.compile(r"content-body-\d+-\d+")})
paragraphs = [p_tag.string for p_tag in content.find_all('p') if p_tag.string]

print('\n'.join(paragraphs))

輸出量

When Wing Commander Abhinandan Varthaman’s MiG 21 was shot down by Pakistan in February, he ejected and was soon captured. He was released as a goodwill gesture nearly three days later and returned to a hero’s welcome across the border. 
Over 51 years ago, Lt Commander John McCain wasn’t so lucky. The future Senator and Republican nominee for US President was in the middle of a bombing mission over North Vietnam in 1967 when his fighter plane was gunned down. He was rescued at Truc Bach Lake and sent to Hoa Lo Prison in Hanoi, which American Prisoners of War (PoWs) sarcastically referred to as the Hanoi Hilton. McCain walked free, but after nearly six years.
Hoa Lo Prison and the War Remnants Museum in Ho Chi Minh City are reminders of Vietnam’s grim past. 
A section of the former is now a museum, detailing primarily Vietnam’s independence struggle against the French, and the period when American PoWs were incarcerated between 1964 and 1973, the year the US’ involvement in the Vietnam War ended.
The brutal French prison system led Vietnamese prisoners to nickname Hoa Lo Prison as “hell on earth”. 
Some prisoners did manage to escape, through two underground sewers, displayed outside. A caption claims that in March 1945, over 100 escaped. What remains of Hoa Lo Prison (after part of it was demolished in the 1990s) details the horrors of oppression and torture inflicted by the imperialists, and a floor is dedicated to the heroics of revolutionary fighters. However, the American section is confined, surprisingly, to just a couple of rooms. The exhibits include the uniforms worn by the captured pilots, their prison clothes, utensils. However, you could be tricked into thinking that Hilton-level luxuries were available to the PoWs, as you see pictures of them playing outdoor sport, chess, being treated to fancy meals, reading letters from home and singing Christmas carols. The prison claims that the PoWs were “treated humanely”, but it contrasts with multiple accounts, on camera and in memoirs, by inmates like McCain and Everett Alvarez Jr (one of the longest-serving PoWs at Hoa Lo) to name a few, that they were inflicted with grotesque acts of torture, comparable with the French.
Fascinating tales by survivors of how they communicated by tapping on walls in secret get no mention here. In these places, it’s hard to expect balanced accounts of war and struggle, so the propaganda at Hoa Lo is hard to miss.
However, no matter whose side of the fence you are on, the destruction and misery of war is real and inescapable. An estimated 58,000 Americans died, the Vietnamese casualties on both sides were exponentially higher. Pictures show parts of Hanoi reduced to a rubble following America’s B-52 carpet bombings in 1972.
The War Remnants Museum takes you head on into the horrors of the Vietnam War in graphic detail. This multi-storey building at the centre of Ho Chi Minh City was inaugurated in 1975 as the ‘Exhibition House for US and Puppet Crimes’. The vast compound displays American tanks, helicopters, fighter planes, Howitzers etc, a treat for defence experts and enthusiasts.
This museum too lacks balanced reporting of the war, but is a very sobering experience for its exhibition of horrific pictures, many of which were recovered from cameras of photographers (133 of them) who perished on duty. The exhibition Requiem, curated by photographers Tim Page and Horst Faas, is a tribute to these photographers, whose images from the battlefield were a shock to the system, to many Americans in particular, who were kept in the dark about the events in Vietnam.
There are many gut-wrenching images — villagers pleading with the US Marines for mercy; a white soldier holding what remained of a Vietnamese soldier ripped apart by a grenade; young children hiding from their captors in a sewer; mass graves. The gallery Agent Orange shows haunting images of the after-effects of dioxin and napalm, used by the Americans to destroy crops and foliage. Even today, successive generations of Vietnamese and American soldiers who came into contact with the deadly dioxin are born with deformities and diseases of the worst kind.
This museum portrays the Americans and their allies as the aggressors. A sign shows the findings of the Bertrand Russell Tribunal (1967), which held the US government “guilty of genocide”. The open air exhibition area recreates parts of the prisoner of war camp at Phu Quoc island, run by the then Saigon government, that detained Viet Cong forces. Captions detail the torture techniques used, and accounts from survivors. The museum, interestingly, doesn’t trumpet the achievements by the North Vietnamese in suppressing the Americans. Images of victory can be seen in the section Historical Truths, which recaps events like the freedom struggle, the 1954 Dien Bien Phu battle that stunned the French into submission, the Fall of Saigon.
The theme on the ground floor is the fight for peace, with images of anti-war protests around the globe, including Calcutta. Pictures of some American PoWs — described, sarcastically or not, as the “special guests” — are reproduced here as well, for the benefit of visitors who couldn’t make it to Hoa Lo.
Pictures show McCain, Alvarez and others returning to Hoa Lo decades after their release, as goodwill visits, revisiting the past, yet not reopening old wounds. Facts may be debated or put to rest, but by the end of the tour, you can’t help but admire the resilience of the Vietnamese.
Political bridges may have been built, but the scars of war will remain, and that’s what the battle-weary country seeks to do through its war museums — remind, forgive, but not forget.

你的意思是這樣的嗎?

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.thehindu.com/life-and-style/travel/the-embers-of-war/article29202579.ece')
soup = bs(r.content, 'lxml')
print([i for i in soup.select_one('[id^=content-body]').get_text().split('\n') if i not in ['','\xa0']])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM