How to extract file name from URL?

Question

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?

Sample URL

Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf

The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting " could not parse " error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.

The URL:

https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033

REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )

The result:

HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033

Now trying to determine how do I pull out everything after the ? where the version data is, so as to extract just the Filename.pdf .

Answer 1

You could try:

[^\\/]+(?=\\?[^\\/]*$)

This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:

https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver

Answer 2

Assuming that the name appears right after the last / and ends with the ? , the regular expression below will leave the name in group 1 where you can get it with \\1 or whatever the tool that you are using supports.

.*\/(.*)\?

It basically says: get everything in between the last / and the first ? after, and put it in group 1.

Another regular expression that only matches the file name that you want but is more complex is:

(?<=\/)[^\/]*(?=\?)

It matches all non- / characters, [^\\/] , immediately preceded by / , (?<=\\/) and immediately followed by ? , (?=\\?) . The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.

Answer 3

This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?

REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")

Google Data Studio Report to demonstrate.

Answer 4

Please try the following regex
[A-Za-z\\_]*.pdf

I have tried it online at https://regexr.com/ . Attaching the screenshot for reference

Please note that this only works for .pdf files

Answer 5

Following regex will extract file name with .pdf extension

(?:[^\/][\d\w\.]+)(?<=(?:.pdf))

You can add more extensions like this,

(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))

Demo

How to extract file name from URL?

Question

5 answers

solution1
1 2018-05-04 04:03:57

solution2
0 2018-05-04 03:50:52

solution3
0 2020-02-25 06:39:18

solution4
0 2020-02-25 11:56:35

solution5
-1 2018-05-04 00:16:30

How to extract file name from URL?

Question

5 answers

solution1 1 2018-05-04 04:03:57

solution2 0 2018-05-04 03:50:52

solution3 0 2020-02-25 06:39:18

solution4 0 2020-02-25 11:56:35

solution5 -1 2018-05-04 00:16:30

solution1
1 2018-05-04 04:03:57

solution2
0 2018-05-04 03:50:52

solution3
0 2020-02-25 06:39:18

solution4
0 2020-02-25 11:56:35

solution5
-1 2018-05-04 00:16:30