简体   繁体   中英

Extract text from a PDF with regex

I have a pdf that looks like this:

and I would like to extract the numbered items into a dictionary:

output = {'01': 'Agriculture and related service activities',
          '011': 'Growing crops, market gardening and horticulture'...}

Currently I am using tika to extract the text from the pdf. But I now need a regex expression to extract the numbered items out of the content. How do I do this?

from tika import parser
raw = parser.from_file(path)
text = raw['content']
regex = ???
match = re.findall(regex, text, flags=re.DOTALL)

The text variable contains the text of the document. It looks like this:

u"\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nSTATISTICS SINGAPORE - Singapore Standard Industrial Classification, 2015\\n\\n\\nSection A: Agriculture and Fishing\\n\\nSSIC 2015 Industry SSIC 2010\\n\\nSECTION A AGRICULTURE AND FISHING\\n\\n01 AGRICULTURE AND RELATED SERVICE ACTIVITIES\\n\\n011 GROWING OF CROPS, MARKET GARDENING AND HORTICULTURE\\n\\n0111 Growing of Food Crops (Non-Hydroponics)\\n01111 Growing of leafy and fruit vegetables 01111\\n01112 Growing of mushrooms 01112\\n01113 Growing of root crops 01113......"

'^' In the front of the regex might not work. Try the code below.

regex = '([\d]+).+?([a-zA-Z].+)'#(\d.+|$)'
match = re.findall(regex, s)
print(match)

Output : [('2015', 'Industry SSIC 2010'),
 ('01', 'AGRICULTURE AND RELATED SERVICE ACTIVITIES'),
 ('011', 'GROWING OF CROPS, MARKET GARDENING AND HORTICULTURE'),
 ('0111', 'Growing of Food Crops (Non-Hydroponics)'),
 ('01111', 'Growing of leafy and fruit vegetables 01111'),
 ('01112', 'Growing of mushrooms 01112'),
 ('01113', 'Growing of root crops 01113......')]

Hope it helps.

您可以尝试以下方法:

regex = ^([\d]+).+?([a-zA-Z].+?)(\d.+|$)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM