简体   繁体   中英

How to match a sentence with multiple dots using regex in python?

I am looking for questions and answers in a .txt file using re in Python. Here is a sample of the text file:

1. Xiva xonligining yirik shaharlari xato berilgan javobni aniqlang.
A) Xiva, Kat        
B) Yangi Urganch, Hazorasp    
C) Qo'ng'irot, Xo'jayli   
D) Vazir, Mang'it
2. Xiva xonligi Buxoro amirligi kabi bekliklarga bo'lingan bo'lib, ularni xon tomonidan tayinlangan ......... boshqargan.
A) beklar         
B) noiblar     
C) beklar va to'ralar      
D) biy va beklar

Don't mind the language. Questions start with numbers followed by . then comes the question body (usually ending with ? , . or ! )

Then comes the answers from A to D followed by a closing bracket ) .

Here is my regex for finding questions: re"^(\d+\.)?\s+[\"']?([.]{2,})?[AZ][^.??]+((.?[??.]['\"]?\s[\"'].[AZ][^.?!]).)+[.?!'\"]+$"

My problem is when there are multiple dots inside the question body like the question #2, my regex cannot match the whole question body. It rather stops at the first dot it sees. How should I go about this? Any help would be appreciated. Thanks.

By the way, here is how I am finding the answers with regex: re"^[a-zA-Z]\)?\s+\w+.+" Suggestions regarding my approach to finding questions and answers are also welcome.

Since your text file is properly formatted. You can try this for extracting questions:

re.findall([\d+][\s\S][\w\W]+[\d\D], t)

You can start by matching the digits, dot and the rest of the line. Then optionally repeat all lines that do not start with an uppercase char AZ followed by )

If you only want the question body, you can capture that in a group.

^\d+\.[^\S\r\n]*([A-Z].*(?:\r?\n(?![A-Z]\)).*)*[.?!])

Explanation

  • ^ Start of string
  • \d+\.[^\S\r\n]* Match 1+ digits, a . and optional whitespace chars without a newline
  • ( Capture group 1
    • [AZ].* Match an uppercase char AZ
    • (?:\r?\n(?.[AZ]\)).*)* Optionally repeat all lines that do not start with an uppercase char AZ followed by )
    • [.?!] Match either . ? or !
  • ) Close group 1

Regex demo

The pattern to match an answer ^[a-zA-Z]\)?\s+\w+.+ has an optional ) and can start with a lowercase char az as well, which would for example also match a test .

If the ) is always there, you don't have to make it optional, and perhaps matching just [AZ] will make the change to get a false positive a bit smaller.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM