Extract text between two pattern in python using regex

Question

I am trying to extract out all texts including categories (ie A, B, C).

A     <some text1> 

B     <some text2> 

C     <some text3>

However, when I apply this regex -

ptrn='\n[A-z]*\t'     

pattern1= '(.*)'+ptrn      

f = re.findall(pattern1,test_doc)

it gives me

f[0] = A     <some text1> 

f[1] = <some text2> 

f[2] = <some text3>

But I want -

f[0] =  A     <some text1>

f[0] =  B     <some text2> 

f[2] =  C     <some text2>

this link has some raw text of many documents. each document has following pattern:

category<tab><sometext> \n

hence the whole corpus looks like this:-

category<tab><sometext1> \n 

category<tab><sometext2> \n

.

.

i want

doc[0] = category<tab><sometext1>

doc[1] = category<tab><sometext2>

.
.
and so on

Any answer/hint will be very helpful :)

Answer 1

Try the following pattern:

import re
pattern = r"(\w+)(\t)(.*)(\b)"

Explanation