简体   繁体   中英

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:

  1. I open the fdf file with the following command:
import re
import os

os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
  1. With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
  1. I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)

Trying to get:

a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"

But I got:

a = []

The problem is in the line a = re.findall(r"nendobj(.*?)W 3\\.0",cadena) but I don't realize where. I have tried many combinations with no success.

I appreciate any comment.

Regards

It seems to me that there are 2 problems:

a) you are looking for nendobj , but the N is actually part of the line break \\n . Thus you'll also not get a leading N in the output, because there is no N.

b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag

Final code:

a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)

Also note, that there will be a second result, confirmed by Regex101 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM