简体   繁体   中英

Using regular expressions to find a pattern

If I have a file that consists of sentences like this:

1001 apple
1003 banana
1004 grapes
1005 
1007 orange

Now I want to detect and print all such sentences where there is a number but no corresponding text (eg 1005), how can I design the regular expression to find such sentences? I find them a bit confusing to construct.

res=[]
with open("fruits.txt","r") as f:
     for fruit in f:
          res.append(fruit.strip().split())

Would it be something like this: re.sub("10**"/.")

Well you don't need a regular expressions for this:

with open("fruits.txt", "r") as f:
    res = [int(line.strip()) for line in f if len(line.split()) == 1]

A regex that would detect a number, then a space, then an underscore word is ([0-9])+[ ]\\w+ .

A good ressource for trying that stuff out is http://regexr.com/

The re pattern for this would be re.sub("[0-9][0-9][0-9][0-9]") . This looks if there are only four numbers and nothing else, so it will find your 1005.

Hope this helps!

There are two ways to go about this: search() and findall() . The former will find the first instance of a match, and the latter will give a list of every match.

In any case, the regex you want to use is "^\\d{4}$" . It's a simple regex which matches a 4-digit number that takes up the entirety of a string, or, in multiline mode, a line. So, to find 'only number' sections, you will use the following code:

# assume 'func' is set to either be re.search or re.findall, whichever you prefer
with open("fruits.txt", "r") as f:
    solo = func("^\d{4}$", f.read(), re.MULTILINE)
# 'solo' now has either the first 'non-labeled' number,
# or a list of all such numbers in the file, depending on
# the function you used. search() will return None if there
# are no such numbers, and findall() will return an empty list.
# if you prefer brevity, re.MULTILINE is equivalent to re.M

Additional explanation of the regex:
^ matches at the beginning of the line.
\\d is a special sequence which matches any numeric digit.
{4} matches the prior element ( \\d ) exactly four times.
$ matches at the end of the line.

Please try:

(?:^|\s+)(\d{4}\b)(?!\s.*\w+)

DEMO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM