[英]Regex for matching alphanumeric within a String containing URLs
在一些場景中,如何在包含URL的String中匹配和提取字母數字字符(和符號)? 我目前正在使用Google Apps腳本從Gmail線程消息中檢索超鏈接文本的簡單正文,我基本上想要匹配並從一些字符串中提取標題,如下所示:
var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
...我只想輸出: "Testing: Stack Overflow Title 123?"
這是另一種情況:
var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
...再次,我只想輸出: "Testing: Stack Overflow Title 123?"
我已經嘗試了以下內容進行初步測試,看看String是否首先包含一個URL(我在其中確認匹配URL的正則表達式工作和輸出: https://www.stackoverflow.com
: https://www.stackoverflow.com
),然后測試是否一個標題存在以最終提取它,但無濟於事:
var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var urlRegex = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/;
var titleRegex = /^[a-zA-Z0-9_:?']*$/;
var containsUrl = urlRegex.test(element);
if (containsUrl) {
var containsTitle = titleRegex.test(scenario1);
if (containsTitle) { // No match, and doesn't run
var title = titleRegex.exec(element)[0];
Logger.log("title: " + title);
}
}
基本上,我想要一個匹配一切的正則表達式模式,但如果可能的話,匹配URL
我們可以捕獲任何順序文本,使用此正則表達式排除看起來像URL的內容,
(?:^|\s+)((?:(?!:\/\/).)*)(?=\s|$)
說明:
(?:^|\\s)
- 匹配行的開頭或一個或多個空格 ((?:(?!:\\/\\/).)*)
- 匹配除包含://
內容之外的任何文本://
字面標識為URL (?=\\s|$)
- 正向前瞻以確保它后跟空格或行尾 這匹配並捕獲除URL之外的任何順序文本。 希望這對你有用。
這是一個Javascript演示。
var arr = ['Testing1: Stack Overflow Title 123? https://www.stackoverflow.com','https://www.stackoverflow.com Testing2: Stack Overflow Title xyz? https://www.stackoverflow.com Hello this is simple text ftp://www.downloads.com/'] for (s of arr) { var reg = /(?:^|\\s+)((?:(?!:\\/\\/).)*)(?=\\s|$)/g; match = reg.exec(s); while (match != null) { console.log(match[1]) match = reg.exec(s); } }
另外,我可以看到你想限制匹配標題中的字符,你可以使用你的字符集[a-zA-Z0-9_:?' ]
[a-zA-Z0-9_:?' ]
(在你的角色集中添加空間以允許捕獲空格)而不是.
在我的正則表達式和使用后面的正則表達式更精確,以避免捕獲具有非預期字符的標題,
(?:^|\s+)((?:(?!:\/\/)[a-zA-Z0-9_:?' ])*)(?=\s|$)
一種可能性是匹配,直到您使用組或正向前瞻遇到第一個URL。
使用可能如下所示的積極前瞻:
\bTesting: .*?(?=\s*(?:https?|ftps?):\/\/)
const regexLookahead = /\\bTesting: .*?(?=\\s*(?:https?|ftps?):\\/\\/)/; [ "Testing: Stack Overflow Title 123? https://www.stackoverflow.com", "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com" ].forEach(s => console.log(s.match(regexLookahead)[0]));
使用您的值將在第一個捕獲組中的捕獲組:
(\bTesting: .*?)\s*(?:https?|ftps?):\/\/
const regexGroup = /(\\bTesting: .*?)\\s*(?:https?|ftps?):\\/\\//; [ "Testing: Stack Overflow Title 123? https://www.stackoverflow.com", "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com" ].forEach(s => console.log(s.match(regexGroup)[1]));
如果你想保留除url之外的所有內容,你可以匹配它們並用空字符串替換:
\s*(?:https?|ftps?):\/\/\S+
[ "Testing: Stack Overflow Title 123? https://www.stackoverflow.com", "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com", "https://www.stackoverflow.com test https://www.stackoverflow.com test https://www.stackoverflow.com test", "https://www.stackoverflow.com test", "test https://www.stackoverflow.com" ].forEach(s => console.log(s.replace(/\\s*(?:https?|ftps?):\\/\\/\\S+/g, '').trim()));
您可以使用.split()
空格字符和.filter()
結果數組來排除以指定協議開頭或以字結尾的元素然后點字符然后字和字符串結尾
const splitURL = s => s.split` `.filter(w => !/^\\w+(?=:\\/\\/)|\\w+\\.\\w+$/.test(w)).join` `; var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com"; var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com"; console.log(splitURL(scenario1), splitURL(scenario2));
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.