I have a pdf that looks like this:
and I would like to extract the numbered items into a dictionary:
output = {'01': 'Agriculture and related service activities',
'011': 'Growing crops, market gardening and horticulture'...}
Currently I am using tika to extract the text from the pdf. But I now need a regex expression to extract the numbered items out of the content. How do I do this?
from tika import parser
raw = parser.from_file(path)
text = raw['content']
regex = ???
match = re.findall(regex, text, flags=re.DOTALL)
The text variable contains the text of the document. It looks like this:
u"\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nSTATISTICS SINGAPORE - Singapore Standard Industrial Classification, 2015\\n\\n\\nSection A: Agriculture and Fishing\\n\\nSSIC 2015 Industry SSIC 2010\\n\\nSECTION A AGRICULTURE AND FISHING\\n\\n01 AGRICULTURE AND RELATED SERVICE ACTIVITIES\\n\\n011 GROWING OF CROPS, MARKET GARDENING AND HORTICULTURE\\n\\n0111 Growing of Food Crops (Non-Hydroponics)\\n01111 Growing of leafy and fruit vegetables 01111\\n01112 Growing of mushrooms 01112\\n01113 Growing of root crops 01113......"
'^' In the front of the regex might not work. Try the code below.
regex = '([\d]+).+?([a-zA-Z].+)'#(\d.+|$)'
match = re.findall(regex, s)
print(match)
Output : [('2015', 'Industry SSIC 2010'),
('01', 'AGRICULTURE AND RELATED SERVICE ACTIVITIES'),
('011', 'GROWING OF CROPS, MARKET GARDENING AND HORTICULTURE'),
('0111', 'Growing of Food Crops (Non-Hydroponics)'),
('01111', 'Growing of leafy and fruit vegetables 01111'),
('01112', 'Growing of mushrooms 01112'),
('01113', 'Growing of root crops 01113......')]
Hope it helps.
您可以尝试以下方法:
regex = ^([\d]+).+?([a-zA-Z].+?)(\d.+|$)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.