简体   繁体   中英

Python: Text extraction and list comprehension

I have extracted text from a pdf file using pdfplumber. The text contains several items of the format 'Exhibit XY' where X is a letter and Y is a number, eg Exhibit C40 or Exhibit R700.

I am trying to reduce the entire extracted text to simply display the various Exhibit XY combinations as a list. My initial thoughts were to convert the text string into a list:

import pdfplumber

with pdfplumber.open(file) as pdf:

    p1 = pdf.pages[0]
    p2 = pdf.pages[1]
    p3 = pdf.pages[2]
    
    p1_text = p1.extract_text()
    p2_text = p2.extract_text()
    p3_text = p3.extract_text()
    
    # print(p1_text, p2_text, p3_text)
    
    full_text = p1_text + p2_text + p3_text
    
    list_full_text = full_text.split()

The output from pdfplumber is as follows:

apple cars 2014 pizza hut. Aftermath, you tried an Exhibit R40; decidedly 50 times 
larger than Exhibit C400. The 1,000 luckiest break had the under dome Exhibit R9. 
Exhibit P21 as well. 0.1 you have not found it again. Exhibit CB12 district office see 
Exhibit MM42. 

In list form, this is:

['apple', 'cars', '2014', 'pizza', 'hut.', 'Aftermath,', 'you', 'tried', 'an', 'Exhibit', 'R40;', 'decidedly', '50', 'times', 'larger', 'than', 'Exhibit', 'C400.', 'The', '1,000', 'luckiest', 'break', 'had', 'the', 'under', 'dome', 'Exhibit', 'R9.', 'Exhibit', 'P21', 'as', 'well.', '0.1', 'you', 'have', 'not', 'found', 'it', 'again.', 'Exhibit', 'CB12', 'district', 'office', 'see', 'Exhibit', 'MM42.']

My sense is that some form of list comprehension might be able to reduce the list to give only Exhibit XY combinations, eg with something like this:

print([i for i in list_full_text if [some condition])

but I'm not sure what condition could capture all of 'Exhibit', 'X' and 'Y'.

Note: The body of text also contains various numbers such as year (eg 1992) or quantities (eg 50). I only need those which are preceded by a letter.

Many thanks, Guy

Try it this way:

ap_lst = [your list above]
for item in ap_lst:
    if 'Exhibit' in ap_lst[ap_lst.index(item)-1]:
        print('Exhibit',item)

Output:

Exhibit R40;
Exhibit C400.
Exhibit R9.
Exhibit P21
Exhibit CB12
Exhibit MM42.

Obviously, you can clean up the output by removing periods, semi-colons, etc.

Edit: explanation of the third line:

For each element in the list, find the index position of that element ( ap_lst.index(item) ). Now we need to check what word is in the immediately preceding list element - that immediately preceding element would have an index position lower by one ( index(item)-1] ) than that of the current element. Then, using this new index position, find out what element is in that position in the list ( ap_lst[ap_lst.index(item)-1] }. If that preceding element consists of the word is Exhibit , you know that the current element is the target exhibit number.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM