I have a resource formatted like the following below:
{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"},
{"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}
I want to extract the file names 6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4
and a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4
from the url.
How would I write a regular expression to match the string at this location?
Based on the strings you provided, you could iterate over the dictionaries, get value for "url" and use the following regex
([^\\/]*)$
Explanation:
() - defines capturing group
[^\/] - Match a single character not present after the ^
\/ - matches the character / literally (case sensitive)
* - Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ - asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
For example:
for record in records:
print(re.search("([^\/]*)$", record['url']).group(1))
In this case, we are exploiting the fact that the filename occurs at the end of the string. Using the $
anchor makes the only valid match one that terminates the string.
If you wanted to do this to a dictionary cast as a string, you could by changing the ending condition. Like so ([^\\/]*?)\\",
. Now ",
terminates the match (note the \\
to escape the "
. See https://regex101.com/r/k9VwC6/25
Finally, if we weren't so lucky that the capturing group was at the end of the string (meaning we couldn't use $
) we could use a negative look behind. You can read up on those here .
You can use re.findall
:
import re
s = [{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, {"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}]
filenames = [re.findall('(?<=/)[\w\-\_]+\.mp4', i['url'])[0] for i in s]
Output:
['6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4', 'a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4']
You can use a short regex [^/]*$
Code:
import re
s = [{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, {"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}]
filenames = [re.findall('[^/]*$', i['url'])[0] for i in s]
print(filenames)`
Output:
['6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4', 'a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4']
Check the regex - https://regex101.com/r/k9VwC6/30
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.