简体   繁体   中英

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?

Sample URL

Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf

The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting " could not parse " error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.

The URL:

https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033

REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )

The result:

HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033

Now trying to determine how do I pull out everything after the ? where the version data is, so as to extract just the Filename.pdf .

You could try:

[^\\/]+(?=\\?[^\\/]*$)

This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:

https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver

Assuming that the name appears right after the last / and ends with the ? , the regular expression below will leave the name in group 1 where you can get it with \\1 or whatever the tool that you are using supports.

.*\/(.*)\?

It basically says: get everything in between the last / and the first ? after, and put it in group 1.

Another regular expression that only matches the file name that you want but is more complex is:

(?<=\/)[^\/]*(?=\?)

It matches all non- / characters, [^\\/] , immediately preceded by / , (?<=\\/) and immediately followed by ? , (?=\\?) . The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.

This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?

REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")

Google Data Studio Report to demonstrate.

3]

Please try the following regex
[A-Za-z\\_]*.pdf

I have tried it online at https://regexr.com/ . Attaching the screenshot for reference
在此处输入图片说明

Please note that this only works for .pdf files

Following regex will extract file name with .pdf extension

(?:[^\/][\d\w\.]+)(?<=(?:.pdf))

You can add more extensions like this,

(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM