简体   繁体   中英

Using Regex to extract file name from URL -- need to exclude some characters

I have a resource formatted like the following below:

{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, 
{"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}

I want to extract the file names 6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4 and a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4 from the url.

How would I write a regular expression to match the string at this location?

Based on the strings you provided, you could iterate over the dictionaries, get value for "url" and use the following regex



() - defines capturing group
[^\/] - Match a single character not present after the ^
\/ - matches the character / literally (case sensitive)
* - Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ - asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

For example:

for record in records:
    print(re.search("([^\/]*)$", record['url']).group(1))

In this case, we are exploiting the fact that the filename occurs at the end of the string. Using the $ anchor makes the only valid match one that terminates the string.

If you wanted to do this to a dictionary cast as a string, you could by changing the ending condition. Like so ([^\\/]*?)\\", . Now ", terminates the match (note the \\ to escape the " . See https://regex101.com/r/k9VwC6/25

Finally, if we weren't so lucky that the capturing group was at the end of the string (meaning we couldn't use $ ) we could use a negative look behind. You can read up on those here .

You can use re.findall :

import re
s = [{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, {"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}]
filenames = [re.findall('(?<=/)[\w\-\_]+\.mp4', i['url'])[0] for i in s]


['6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4', 'a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4']

You can use a short regex [^/]*$


import re
s = [{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, {"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}]
filenames = [re.findall('[^/]*$', i['url'])[0] for i in s]


['6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4', 'a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4']

Check the regex - https://regex101.com/r/k9VwC6/30

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM