Extract a string between other two in Python

Question

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:

I open the fdf file with the following command:

import re
import os

os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()

With the opened file, I create a string called cadena with all the info I need. For example:

cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"

I try to extract the needed info with the following line:

a = re.findall(r"nendobj(.*?)W 3\.0",cadena)

Trying to get:

a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"

But I got:

a = []

The problem is in the line a = re.findall(r"nendobj(.*?)W 3\\.0",cadena) but I don't realize where. I have tried many combinations with no success.

I appreciate any comment.

Regards

Answer 1

It seems to me that there are 2 problems:

a) you are looking for nendobj , but the N is actually part of the line break \\n . Thus you'll also not get a leading N in the output, because there is no N.

b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag

Final code:

a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)

Also note, that there will be a second result, confirmed by Regex101 .

Extract a string between other two in Python

Question

1 answers

solution1
0 2020-10-18 19:32:40

Extract a string between other two in Python

Question

1 answers

solution1 0 2020-10-18 19:32:40

solution1
0 2020-10-18 19:32:40