I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT
in Google Data Studio seems unique. Tried the suggestion but kept getting " could not parse " error. I was able to strip out the first part of the url with the following. Event Label
is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the ?
where the version data is, so as to extract just the Filename.pdf
.
You could try:
This will match CaptialForecasting_Datasheet.pdf
even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver
Assuming that the name appears right after the last /
and ends with the ?
, the regular expression below will leave the name in group 1 where you can get it with \\1
or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last /
and the first ?
after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non- /
characters, [^\\/]
, immediately preceded by /
, (?<=\\/)
and immediately followed by ?
, (?=\\?)
. The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.
This REGEXP_EXTRACT
formula captures the characters a-zA-Z0-9_.
between /
and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.
Please try the following regex[A-Za-z\\_]*.pdf
I have tried it online at https://regexr.com/ . Attaching the screenshot for reference
Please note that this only works for .pdf files
Following regex will extract file name with .pdf
extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.