简体   繁体   中英

Extract text between two pattern in python using regex

I am trying to extract out all texts including categories (ie A, B, C).

A     <some text1> 

B     <some text2> 

C     <some text3> 

However, when I apply this regex -

ptrn='\n[A-z]*\t'     

pattern1= '(.*)'+ptrn      

f = re.findall(pattern1,test_doc)      

it gives me

f[0] = A     <some text1> 

f[1] = <some text2> 

f[2] = <some text3> 

But I want -

f[0] =  A     <some text1>

f[0] =  B     <some text2> 

f[2] =  C     <some text2> 

http://csmining.org/tl_files/Project_Datasets/r8%20r52/r8-test-all-terms.txt

this link has some raw text of many documents. each document has following pattern:

category<tab><sometext> \n 

hence the whole corpus looks like this:-

category<tab><sometext1> \n 

category<tab><sometext2> \n

.

.

i want

doc[0] = category<tab><sometext1>

doc[1] = category<tab><sometext2>

.
.
and so on

Any answer/hint will be very helpful :)

Try the following pattern:

import re
pattern = r"(\w+)(\t)(.*)(\b)"

Explanation

  • (\\w+) matches any word character, one or many times
  • \\t matches the tab character literally
  • (.*) matches everything except line terminators
  • (\\b) is a word boundary

See a demo on regex101

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM