使用正则表达式从 URL 中提取文件名——需要排除一些字符

Question

I have a resource formatted like the following below:我有一个格式如下的资源：

{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, 
{"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}

I want to extract the file names 6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4 and a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4 from the url.我想从 url 中提取文件名6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4和a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4 。

How would I write a regular expression to match the string at this location?我将如何编写正则表达式来匹配此位置的字符串？

Answer 1

Based on the strings you provided, you could iterate over the dictionaries, get value for "url" and use the following regex根据您提供的字符串，您可以遍历字典，获取“url”的值并使用以下正则表达式

([^\\/]*)$

Explanation:解释：

() - defines capturing group
[^\/] - Match a single character not present after the ^
\/ - matches the character / literally (case sensitive)
* - Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ - asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

For example:例如：

for record in records:
    print(re.search("([^\/]*)$", record['url']).group(1))

In this case, we are exploiting the fact that the filename occurs at the end of the string.在这种情况下，我们利用了文件名出现在字符串末尾的事实。 Using the $ anchor makes the only valid match one that terminates the string.使用$锚点使唯一有效的匹配项终止字符串。

If you wanted to do this to a dictionary cast as a string, you could by changing the ending condition.如果您想将字典转换为字符串，则可以通过更改结束条件来执行此操作。 Like so ([^\\/]*?)\\", . Now ", terminates the match (note the \\ to escape the " . See https://regex101.com/r/k9VwC6/25像这样([^\\/]*?)\\", . 现在",终止匹配（注意\\转义" 。见https://regex101.com/r/k9VwC6/25

Finally, if we weren't so lucky that the capturing group was at the end of the string (meaning we couldn't use $ ) we could use a negative look behind.最后，如果我们不是很幸运，捕获组位于字符串的末尾（这意味着我们不能使用$ ），我们可以使用负向后面。 You can read up on those here .你可以在这里阅读这些内容。

Answer 2

You can use re.findall :您可以使用re.findall ：

import re
s = [{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, {"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}]
filenames = [re.findall('(?<=/)[\w\-\_]+\.mp4', i['url'])[0] for i in s]

Output:输出：

['6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4', 'a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4']

Answer 3

You can use a short regex [^/]*$您可以使用简短的正则表达式[^/]*$

Code:代码：

import re
s = [{"url": "http://res1.icourses.cn/share/process17//mp4/2017/3/17/6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4", "name": "1-课程导学"}, {"url": "http://res2.icourses.cn/share/process17//mp4/2017/3/17/a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4", "name": "2-计算机网络的定义与分类"}]
filenames = [re.findall('[^/]*$', i['url'])[0] for i in s]
print(filenames)`

Output:输出：

['6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4', 'a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4'] ['6332c641-28b5-43a0-894c-972bd804f4e1_SD.mp4'、'a21902b6-8680-4bdf-8f47-4f99d1354475_SD.mp4']

Check the regex - https://regex101.com/r/k9VwC6/30检查正则表达式 - https://regex101.com/r/k9VwC6/30

使用正则表达式从 URL 中提取文件名——需要排除一些字符

问题描述

3 个解决方案

解决方案1
0 2018-02-22 13:00:37

解决方案2
0 已采纳 2018-02-22 15:14:54

解决方案3
0 2018-03-05 13:43:44

使用正则表达式从 URL 中提取文件名——需要排除一些字符

问题描述

3 个解决方案

解决方案1 0 2018-02-22 13:00:37

解决方案2 0 已采纳 2018-02-22 15:14:54

解决方案3 0 2018-03-05 13:43:44

解决方案1
0 2018-02-22 13:00:37

解决方案2
0 已采纳 2018-02-22 15:14:54

解决方案3
0 2018-03-05 13:43:44