简体   繁体   中英

Regular expression for simple patterns

Problem

I have an image dataset that describes different activities appearing in the particular images. Each image in the dataset is named as <activity>_<num> . For example, educating_13.jpg , practicing_147.jpg , etc.

Now I want to select images with same activity, say "cooking", and I decided to do this using re module in Python. The script I wrote is like

pattern = "^(\w+)_(\d+)$"
for filename in os.listdir("."):
    root, _ = os.path.splitext(filename)
    activity = re.match(pattern, root).group(1)
    if activity == "cooking":
        # do something

However, even though many images are successfully processed. It finally aborted with AttributeError . It seems that some of the images could not be matched with the specified pattern.

So do I make some mistake? Any input is appreciated.

EDIT:

By using exception mechanism in Python, it turns out that of almost 150 thousand images, there is a text file called temp.txt and this is the one that violates the pattern.

Without using regex. Using str.split

Ex:

for filename in os.listdir("."):
    root, _ = os.path.splitext(filename)
    if "_" in root:
        activity, num = root.split("_")
        if activity == "cooking":
            # do something

re.match(pattern, root) can return None if not matching

  1. You can check the result of re.match(pattern, root) == None and find the image
  2. use https://regex101.com/ to check your regexp with name of images

If re.match(pattern, root) is None then calling .group(1) will give you the attribute error. So in certain cases you don't seem to match all entries in your directory.

It's hard to know which ones are giving you problems, but by default \\w matches only [a-zA-Z0-9_] , so:

  • Do any files contain punctuation characters (eg %)?
  • Do any files contain non-ASCII characters (eg ñ)?
  • Are there non-dataset related files in the directory as well?

You could post the directory listing, then maybe we can spot the file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM