简体   繁体   English

如何使用Selenium IDE和python从pdf pag(链接)获取所有页面文本

[英]How to get all pages text from pdf pag(links) using selenium IDE and python

I am trying to get text from PDF pages for that I am going to hit pdf page link one by one using XPATH selenium IDE and python But it gives me empty data, Sometimes It gives me one page content of PDF page but not in a particular format. 我试图从PDF页面获取文本,因为我要使用XPATH Selenium IDE和python逐一点击pdf页面链接,但这给了我空数据, 有时它给了我PDF页面的一页内容,但不是特定页面格式。

How can I get text from all pages of pdf link? 如何从pdf链接的所有页面获取文本?

Here is My code: 这是我的代码:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 

url = "http://www.incredibleindia.org"
driver = webdriver.Firefox()
driver.get(url) 
# wait for menu to being loaded
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.menu li > a")))

#article under media tab 
article_link = [a.get_attribute('href') for a in   driver.find_elements_by_xpath("html/body/div[3]/div/div[1]/div[2]/ul/li[3]/ul/li[6]/a")]
#all important news links under trade tab 
for link in article_link:
    print link
    driver.get(link) 
    #check article sublinks css available on article link page
    try:
         WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.article-full-div")))
    except TimeoutException:
         print driver.title, "No news links under media tab"
    #alrticle sub links under article tab 
    article_sub_links = [a.get_attribute('href') for a in   driver.find_elements_by_xpath(".//*[@id='article-content']/div/div[2]/ul/li/a")]

    print "article sub links"
    for link in article_sub_links:
        print link

        driver.get(link)  
        try:
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.textLayer")))
        except TimeoutException:
            print driver.title, "No news links under media tab"

        content = [a.text for a in driver.find_elements_by_xpath(".//*[contains(@id,'pageContainer')]")] 
        print content 
        for data in content:
            print data

Output: 输出:

http://www.incredibleindia.org/en/media-black-2/articles
article sub links
http://www.incredibleindia.org/images/articles/Ajanta.pdf
[u'', u'', u'']



http://www.incredibleindia.org/images/articles/Bedhaghat.pdf
404 - Error: 404 No news links under media tab` 
[]
http://www.incredibleindia.org/images/articles/Bellur.pdf
[u'', u'', u'']



http://www.incredibleindia.org/images/articles/Bidar.pdf
[u'', u'', u'']



http://www.incredibleindia.org/images/articles/Braj.pdf
[u'', u'', u'', u'']




http://www.incredibleindia.org/images/articles/Carnival.pdf
[u'', u'', u'']`

I think you need to go down to the "textlayer" ( div element with class="textlayer" inside each page container). 我认为您需要进入“ textlayer”(每个页面容器内具有class="textlayer" div元素)。 You also need to use continue in exception handling blocks: 您还需要在异常处理块中使用continue

for link in article_sub_links:
    driver.get(link)

    try:
        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.textLayer")))
    except TimeoutException:
        print driver.title, "Empty content"
        continue

    content = [a.text for a in driver.find_elements_by_css_selector("div[id^=pageContainer] div.textLayer")]
    for data in content:
        print data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python和selenium IDE获取网页上的所有链接 - How to get all links on a web page using python and selenium IDE 使用Python从PDF中提取文本 - 所有页面和输出 - 文件 - Extracting text from a PDF - All pages and Output - file using Python Python 和 selenium - 从网页获取所有链接 - Python and selenium - get all links from webpage 当使用 Python 和 Selenium 抓取 web 时,如何从单个页面获取所有 href 链接? - How can I get all the href links from a single page when web scraping using Python and Selenium? 如何使用 selenium 从网页获取所有链接? - How to get all links from a webpage using selenium? 如何通过selenium python从所有页面获取数据 - How to get data from all pages by selenium python 如何使用 Selenium 从 unsplash 获取所有下载链接? - How to get all download links from unsplash using Selenium? Python Selenium获取具有相同文本的所有链接的列表 - Python Selenium get list of all links with the same text 如何使用 selenium python 从具有多个页面(分页)的特定 div 容器中获取子元素的所有超链接 - how to get all the hyperlinks from child elements from a specific div container having multiple pages(pagination) using selenium python 如何使用python美丽汤将所有页面的所有链接保存到csv - How to save all links from all pages to csv using python beautiful soup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM