简体   繁体   中英

Regex to match until specific string or end of string met

I'm trying to create the right regex to use in python for muli-line match of the following scenario. I need to skip one line after the match to string Description\s: then get all the text before the first occurrence of \s.\n OR the string Homepage: OR the end of the string.

I'm trying the following regex however something is missing and not all of the scenarios covered: Description\s*:\s*.*\n(?P<description>[\w\s\$\&\+\,\:\;\=\?\@\#\|\'\<\>\.\^\*\(\)\%\.\-]*\n\s*)\s\.

Scenario 1: Expected result: "libX11-xcb provides functions needed by clients which take advantage of Xlib/XCB to mix calls to both Xlib and XCB over the same X connection."

Pre-Depends: multiarch-support
Description: Xlib/XCB interface library
 libX11-xcb provides functions needed by clients which take advantage of
 Xlib/XCB to mix calls to both Xlib and XCB over the same X connection.
 .
 More information about X.Org can be found at:
 <URL:http://www.X.org>
 .
 More information about XCB can be found at:
 <URL:http://xcb.freedesktop.org>
 .
 This module can be found at
 git://anongit.freedesktop.org/git/xorg/lib/libX11

Scenario 2: Expected result: "This package contains a number of important utilities, most of which are oriented towards maintenance of your system. Some of the more important utilities included in this package allow you to partition your hard disk, view kernel messages, and create new filesystems."

Essential: yes
Installed-Size: 2999
Replaces: bash-completion (<< 1:2.1-4.1~), initscripts (<< 2.88dsf-59.2~), mount (= 2.26.2-3), mount (= 2.26.2-3ubuntu1), sysvinit-utils (<< 2.88dsf-59.1~)
Pre-Depends: libblkid1 (>= 2.25), libc6 (>= 2.15), libfdisk1 (>= 2.29~rc2), libmount1 (>= 2.25), libncursesw5 (>= 6), libpam0g (>= 0.99.7.1), libselinux1 (>= 2.6-3~), libsmartcols1 (>= 2.28~rc1), libsystemd0, libtinfo5 (>= 6), libudev1 (>= 183), libuuid1 (>= 2.16), zlib1g (>= 1:1.1.4)
Conffiles:
 /etc/default/hwclock 3916544450533eca69131f894db0ca12
Description: miscellaneous system utilities
 This package contains a number of important utilities, most of which
 are oriented towards maintenance of your system. Some of the more
 important utilities included in this package allow you to partition
 your hard disk, view kernel messages, and create new filesystems.

Scenario 3: Expected result: "libcurl is an easy-to-use client-side URL transfer library, supporting DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP."

Architecture: blob
Multi-Arch: same
Recommends: ca-certificates
Description: easy-to-use client-side URL transfer library (OpenSSL flavour)
 libcurl is an easy-to-use client-side URL transfer library, supporting DICT,
 FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S,
 RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP.
 .
 libcurl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP
 form based upload, proxies, cookies, user+password authentication (Basic,
 Digest, NTLM, Negotiate, Kerberos), file transfer resume, http proxy tunneling
 and more!
 .
 libcurl is free, thread-safe, IPv6 compatible, feature rich, well supported,
 fast, thoroughly documented and is already used by many known, big and
 successful companies and numerous applications.
 .
 SSL support is provided by OpenSSL.
Homepage: http://curl.haxx.se

Will appreciate any help to get the right expression.

This should work.

import re
match = re.search(r'Description:.*?\n(.*?)(\s.\n|$)', str1, re.DOTALL)
print(match.group(1))

As an alternative, you could also get the matches without using re.DOTALL matching all lines that do not start with a space and dot, end of line of Homepage using a negative lookahead to prevent unnecessary backtracking using .*?

Note to escape the dot \. to match it literally.

\bDescription:.*\r?\n(?P<description>(?:(?! \.|$|Homepage).*(?:\r?\n)?)*)

In parts:

  • \bDescription:.*\r?\n Match Description: and the rest of the line and a newline
  • (?P<description> Named group description
    • (?: Non capture group
      • (?. \.|$|Homepage) Assert what is directly to the right is not one of the alternatives
      • .*(?:\r?\n)? Match any char except a newline 0+ times and match optional newline
    • )* Close non capture group and repeat 0+ times
  • ) Close group 1

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM