简体   繁体   English

使用Python中的Selenium从网页中提取文本

[英]Extract text from webpage using Selenium in Python

How could i use python selenium to extract " : Sahih al-Bukhari 248 " 我怎么能用python selenium来提取“ : Sahih al-Bukhari 248

the following does not seem to work 以下似乎不起作用

reference = find_element_by_xpath(".//div[3]/table/tbody/tr[1]/td[2]").text
print reference

see html code below 请参阅下面的HTML代码

 <div class="actualHadithContainer">
    <!-- Begin hadith -->
    <a name="1"></a>
    <div class="englishcontainer">
    <div class="english_hadith_full" style="display: block;">
    <div class="hadith_narrated"><p>Narrated `Aisha:</p></div>
    <div class="text_details">
    <p>Whenever the Prophet (ﷺ) took a bath after Janaba he started by washing his hands and then performed ablution like that for the prayer. After that he would put his fingers in water and move the roots of his hair with them, and then pour three handfuls of water over his head and then pour water all over his body.</p></div>
    <div class="clear"></div></div></div>
    <div class="arabic_hadith_full arabic"><span class="arabic_sanad arabic"></span>
    <span class="arabic_text_details arabic">حَدَّثَنَا عَبْدُ اللَّهِ بْنُ يُوسُفَ، قَالَ أَخْبَرَنَا مَالِ</span><span class="arabic_sanad arabic"></span></div>
    <!-- End hadith -->
    <div class="bottomItems">
    <table class="hadith_reference" cellspacing="0" cellpadding="0">
    <tbody><tr><td><b>Reference</b></td>
    <td>&nbsp;:&nbsp;Sahih al-Bukhari 248</td></tr>
    <tr><td>In-book reference</td>
    <td>&nbsp;:&nbsp;Book 5, Hadith 1</td></tr>
    <tr><td>USC-MSA web (English) reference</td><td>&nbsp;: Vol. 1, Book 5, Hadith 248</td></tr> 
    <tr><td>&nbsp;&nbsp;<i>(deprecated numbering scheme)</i></td></tr></tbody></table><div class="hadith_permalink"><a href="javascript: void(0);" onclick="reportHadith(2490, 'h102490')">Report Error</a> | <span class="sharelink" onclick="share('/bukhari/5/1')">Share</span></div></div>
    <div class="clear"></div></div>

I am using the code below to extract other items but having difficulties with the required excerpt above. 我正在使用下面的代码来提取其他项目,但是上面所需的摘录有困难。

Code: 码:

from selenium import webdriver
import os
import re
driver = webdriver.PhantomJS()
driver.implicitly_wait(30)
driver.set_window_size(1120, 550)
driver.get("https://www.sunnah.com/bukhari/5");
print driver.title
print driver.find_element_by_css_selector('.book_page_english_name').text
print driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/div[3]').text

for person in driver.find_elements_by_class_name('actualHadithContainer'):
    try:
        title1 = person.find_element_by_xpath('.//div[@class="hadith_narrated"]/p').text
        if title1:
            print title1
        else:
            print "exception"
            title1 = person.find_element_by_xpath('.//div[@class="hadith_narrated"]').text
            print title1
        title2 = person.find_element_by_xpath('.//div[@class="text_details"]/p').text
        if title2:
            print title2
        else:
            title2 = person.find_element_by_xpath('.//div[@class="text_details"]').text
            print title2

        reference = find_element_by_xpath(".//div[3]/table/tbody/tr[1]/td[2]").text
        print reference

    except:
        print "exception"

When using selenium API, you should perform some tasks like click a button or scroll to bottom. 使用selenium API时,您应该执行一些任务,例如单击按钮或滚动到底部。

When you need to extract information from HTML, you should use BeautifulSoup, it is much simple: 当您需要从HTML中提取信息时,您应该使用BeautifulSoup,它非常简单:

from selenium import webdriver
import os
import re
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.implicitly_wait(30)
driver.set_window_size(1120, 550)
driver.get("https://www.sunnah.com/bukhari/5")
soup = BeautifulSoup(driver.page_source, 'lxml')
soup.find(name='table', class_='hadith_reference').tr.text

And this page is static, you should use requests: 这个页面是静态的,您应该使用请求:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.sunnah.com/bukhari/5")
soup = BeautifulSoup(r.text, 'lxml')
for div in soup.find_all(class_='actualHadithContainer'):
    ref = div.find(name='table', class_='hadith_reference').tr.text
    print(ref)

out: 出:

Reference : Sahih al-Bukhari 248
Reference : Sahih al-Bukhari 249
Reference : Sahih al-Bukhari 250
Reference : Sahih al-Bukhari 251
Reference : Sahih al-Bukhari 252
Reference : Sahih al-Bukhari 253
Reference : Sahih al-Bukhari 254

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM