简体   繁体   中英

Regular expression for converting rtf to text in python

I am going to use this Regular expression on my rtf file:

((?:^|\s)[^\s\\]+(?:\\(?!line)[A-Za-z]+\n?(?:-?\d+)?[ ]?)+)(\b[^\s\\])

As you see in https://regexr.com/

xxx\par\fi-240\li720 could not be matched completely due to having "-->" after it in my rtf file. The regular regex can only detect " xxx\par\fi- "

Do you have any idea how to solve it?

This is my rtf file:

{\rtf1\ansi\ansicpg1252\cocoartf2513
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw15000\paperh15840\margl1440\margt1440\margr1440\margb1440\deftab1134\widowctrl\lytexcttp\formshade\headery720\footery720\pgwsxn15000\pghsxn15840\marglsxn1440\margtsxn1440\margrsxn1440\margbsxn1440\pgbrdropt32\pard\pard\fi-240\li720\tx1200\tx1920\tx2640\tx3360\tx4080\tx4800\tx5520\tx6240\tx6960\tx7680\tx8400\tx9120\tx9840\tx10560\itap0\nowidctlpar\plain\f2\fs20\b\chshdng0\chcfpat0{XX, XX   XX\plain\f2\fs20\chshdng0\chcfpat0\par\fi-240\li720\tx1200\tx1920\tx2640\tx3360\tx4080\tx4800\tx5520\tx6240\tx6960\tx7680\tx8400\tx9120\tx9840\tx10560 URN: xxx  DOB: xx  Sex: XX\par\fi-240\li720\tx1200\tx1920\tx2640\tx3360\tx4080\tx4800\tx5520\tx6240\tx6960\tx7680\tx8400\tx9120\tx9840\tx10560 Home address: 3 xxx xx, xxxxx 3134\par\pard\fi-240\li720\pard\pard\fi-240\li720\itap0\nowidctlpar Home Phone:   Mobile Phone:}
xxxx\par\fi-240\li720 swab xxx\par\fi-240\li720 to d/w xxxx\par\fi-240\li720 -->case x/  XX\par\fi-240\li720 to x/x xxx}

The current pattern captures (\b[^\s\\]) in the last group, which starts with a word boundary and expects to match a single non whitespace char except \

In the example data, the next char after the whitespace char is a - , and there is no word boundary between a whitespace char and - .

What you might do is use an alternation which also accepts a - after it (\b[^\s\\]|-)

The pattern would then look like

((?:^|\s)[^\s\\]+(?:\\(?!line)[A-Za-z]+\n?(?:-?\d+)?[ ]?)+)(\b[^\s\\]|-)

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM