简体   繁体   中英

python lxml.html: pull preceding text in html docstring

I'm trying to identify a given <table> element based on the text that precedes it in the html document.

My current method is to stringify each html table element and search for its text index within the file text:

filing_text=request.urlopen(url).read()

#some text cleanup here to make lxml's output match the .read() content
ref_text = lxml.html.tostring(filing_text).upper().\
              replace(b"&#160;",b"&NBSP;")
    tbl_count=0
    for tbl in self.filing_tree.iterfind('.//table'):
        text_ind=reftext.find(lxml.html.tostring(tbl).\
                              upper().replace(b"&#160;",b"&NBSP;"))
        start_text=lxml.html.tostring(tbl)[0:50]
        tbl_count+=1
        print ('tbl: %s; position: %s; %s'%(tbl_count,text_ind,start_text))

Given the starting index of the table element, I can then search x characters preceding for text that may identify help to identify the table's content.

Two concerns with this approach:

  1. Since the tag density (ie, how much of the filing text is markup versus content) varies from url to url, it's hard to standardize my search range in the preceding text. 2500 characters of html may encompass 300 characters of actual content or 2000
  2. Serializing and searching once per table element seems rather inefficient. It adds more overhead to a webscraping workflow than I'd like

Question: Is there a better way to do this? Is there an lxml method that can extract text content prior to a given element? I'm imagining something like itertext() that goes backwards from the element, recursively through the html docstring.

Use beautiful soup. Just a snippit to get you started:

>>> from bs4 import BeautifulSoup
>>> stupid_html = "<html><p> Hello </p><table> </table></html>"
>>> soup = BeautifulSoup(stupid_html )
>>> list_of_tables = soup.find_all("table")
>>> print( list_of_tables[0].previous )
 Hello 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM