I am trying to extract out all texts including categories (ie A, B, C).
A <some text1>
B <some text2>
C <some text3>
However, when I apply this regex -
ptrn='\n[A-z]*\t'
pattern1= '(.*)'+ptrn
f = re.findall(pattern1,test_doc)
it gives me
f[0] = A <some text1>
f[1] = <some text2>
f[2] = <some text3>
But I want -
f[0] = A <some text1>
f[0] = B <some text2>
f[2] = C <some text2>
http://csmining.org/tl_files/Project_Datasets/r8%20r52/r8-test-all-terms.txt
this link has some raw text of many documents. each document has following pattern:
category<tab><sometext> \n
hence the whole corpus looks like this:-
category<tab><sometext1> \n
category<tab><sometext2> \n
.
.
i want
doc[0] = category<tab><sometext1>
doc[1] = category<tab><sometext2>
.
.
and so on
Any answer/hint will be very helpful :)
Try the following pattern:
import re
pattern = r"(\w+)(\t)(.*)(\b)"
Explanation
(\\w+)
matches any word character, one or many times \\t
matches the tab character literally (.*)
matches everything except line terminators (\\b)
is a word boundary
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.