简体   繁体   English

(Java)RegEx从CSS获取URL?

[英](Java) RegEx to get the URLs from CSS?

I'm parsing CSS to get the URLs out of linked style sheets. 我正在解析CSS,以便从链接的样式表中获取URL。 This is a Java app. 这是一个Java应用程序。 ( I tried using the CSSParser ( http://cssparser.sourceforge.net/ ), however, it is silently dropping many of the rules when it parses. ) 我尝试使用CSSParser( http://cssparser.sourceforge.net/ ),但是,它在解析时会默默地删除许多规则。

So I'm just using Regex. 所以我只是在使用Regex。 I'd like a regex that gets me just the URLs, and is robust enough to deal with real css from the wild: 我想要一个仅获取URL的正则表达式,并且足够健壮以应对来自狂野的真实CSS:

background-image: url('test/test.gif');
background: url("test2/test2.gif");
background-image: url(test3/test3.gif);
background: url   ( test4/ test4.gif );
background: url( " test5/test5.gif"   );

You get the idea. 你明白了。 This is in Java's regex implementation ( not my favorite ). 这是在Java的regex实现中( 不是我的最爱 )。

The problem with regexes is that they are sometimes too strict than you need. 正则表达式的问题在于它们有时过于严格,超出了您的需求。 If you shown us your currently non-perfectly-working regex I would have been able to help you more. 如果您向我们展示了您当前无法正常工作的正则表达式,我将能够为您提供更多帮助。

First comment: browsers tend to tolerate the majority of HTML/CSS mistakes (NOT JavaScript, which is a programming and not a markup language). 第一条评论:浏览器倾向于容忍大多数 HTML / CSS错误(不是JavaScript,这是一种编程而非标记语言)。

You could start with the background(-image)? 您可以从background(-image)?开始background(-image)? token to lock the first part. 令牌以锁定第一部分。 How to proceed? 如何进行? Very difficult... 非常困难...

You always have colon, so you can add to the constant part of the token, and then, judging from your example (not from CSS specs) a variable number of whitespaces followed by url token. 总是带有冒号,因此您可以将其添加到令牌的常量部分,然后根据示例(而非CSS规范)判断可变数量的空格,后跟url令牌。 A variable number of whitespaces is [\\w]* , and this becomes part of our regex. 可变数量的空格是[\\w]* ,这成为我们正则表达式的一部分。

I tried this with RegexBuddy 我用RegexBuddy尝试过

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)\);

Unfortunately, it captures whitespaces inside URLs 不幸的是,它捕获URL内的空格

Matched text: background-image: url('test/test.gif');
Match offset: 0
Match length: 39
Backreference 1: -image
Backreference 1 offset: 10
Backreference 1 length: 6
Backreference 2: 'test/test.gif'
Backreference 2 offset: 22
Backreference 2 length: 15

Matched text: background: url   ( test4/ test4.gif );
Match offset: 119
Match length: 39
Backreference 1: 
Backreference 1 offset: -1
Backreference 1 length: 0
Backreference 2:  test4/ test4.gif 
Backreference 2 offset: 138
Backreference 2 length: 18

So, when you get the URL with this you must trim the string. 因此,当您获得带有此URL的URL时,必须修剪字符串。 I couldn't exclude whitespaces from url group as of example 4, which, however, should match a URL with a whitespace in it , and which shouldn't be correct is this examples as soon as you don't have a %20test4.gif file 我无法从示例4中排除url组中的空格,但是,它应该与其中包含空格的URL匹配 ,并且在没有%20test4.gif情况下,这个示例也不正确%20test4.gif文件

[Edit] I prefer the following version of the regex [编辑]我更喜欢正则表达式的以下版本

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)[\s]*\)[\s]*;

It tolerates more whitespaces 容忍更多空白

Can you use ONLY regexs? 您只能使用正则表达式吗? Your life could be made so much easier if you used string functions to remove all the spaces, then you can write a regex that doesn't have to worry about the whitespace. 如果使用字符串函数删除所有空格,您的生活会变得非常轻松,那么您可以编写不必担心空格的正则表达式。

Here's a quick one, might not work very well: 这是一个快速的方法,可能效果不佳:

background(-image)?:url\\(["']?(.*)["']?\\);

The second capture group should give you what you want. 第二个捕获组应该给您您想要的。

The .* should probably be replaced with a character class that contains all the characters a valid path can contain. .*应该替换为包含有效路径可以包含的所有字符的字符类。

Regex-es are really hard to maintain. 正则表达式确实很难维护。 I suggest you look at SAC: 我建议您看一下SAC:

http://www.w3.org/Style/CSS/SAC/Overview.en.html http://www.w3.org/Style/CSS/SAC/Overview.en.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM