簡體   English   中英

Swift - 用於提取值的正則表達式

[英]Swift - Regex to extract value

我想從具有唯一起始字符和結束字符的字符串中提取值。 在我的例子中是em

"Fully <em>Furni<\/em>shed |Downtown and Canal Views",

結果

帶家具

我想您想刪除標簽。

如果反斜杠只是虛擬的,則模式非常簡單:基本上<em>帶有可選的斜杠/?

let trimmedString = string.replacingOccurrences(of: "</?em>", with: "", options: .regularExpression)

同時考慮反斜杠

let trimmedString = string.replacingOccurrences(of: "<\\\\?/?em>", with: "", options: .regularExpression)

如果只想提取Furnished ,則必須捕獲組:標簽和結束標簽之后的所有內容之間的字符串,直到下一個空白字符。

let string = "Fully <em>Furni<\\/em>shed |Downtown and Canal Views"
let pattern = "<em>(.*)<\\\\?/em>(\\S+)"
do {
    let regex = try NSRegularExpression(pattern: pattern)
    if let match = regex.firstMatch(in: string, range: NSRange(string.startIndex..., in: string)) {
        let part1 = string[Range(match.range(at: 1), in: string)!]
        let part2 = string[Range(match.range(at: 2), in: string)!]
        print(String(part1 + part2))
    }
} catch { print(error) }

正則表達式:

如果要通過正則表達式實現此目的,可以使用Valexa的答案

public extension String {
    public func capturedGroups(withRegex pattern: String) -> [String] {
        var results = [String]()

        var regex: NSRegularExpression
        do {
            regex = try NSRegularExpression(pattern: pattern, options: [])
        } catch {
            return results
        }
        let matches = regex.matches(in: self, options: [], range: NSRange(location:0, length: self.count))

        guard let match = matches.first else { return results }

        let lastRangeIndex = match.numberOfRanges - 1
        guard lastRangeIndex >= 1 else { return results }

        for i in 1...lastRangeIndex {
            let capturedGroupIndex = match.range(at: i)
            let matchedString = (self as NSString).substring(with: capturedGroupIndex)
            results.append(matchedString)
        }

        return results
    }
}

像這樣:

let text = "Fully <em>Furni</em>shed |Downtown and Canal Views"
print(text.capturedGroups(withRegex: "<em>([a-zA-z]+)</em>"))

結果:

[“ Furni”]

NSAttributedString:

如果您想突出顯示或者只需要去除標簽或不能使用第一種解決方案的任何其他原因,也可以使用NSAttributedStringNSAttributedString

extension String {
    var attributedStringAsHTML: NSAttributedString? {
        do{
            return try NSAttributedString(data: Data(utf8),
                                          options: [
                                            .documentType: NSAttributedString.DocumentType.html,
                                            .characterEncoding: String.Encoding.utf8.rawValue],
                                          documentAttributes: nil)
        }
        catch {
            print("error: ", error)
            return nil
        }
    }

}

func getTextSections(_ text:String) -> [String] {
    guard let attributedText = text.attributedStringAsHTML else {
        return []
    }
    var sections:[String] = []
    let range = NSMakeRange(0, attributedText.length)

    // we don't need to enumerate any special attribute here,
    // but for example, if you want to just extract links you can use `NSAttributedString.Key.link` instead
    let attribute: NSAttributedString.Key = .init(rawValue: "")

    attributedText.enumerateAttribute(attribute,
                                      in: range,
                                      options: .longestEffectiveRangeNotRequired) {attribute, range, pointer in

                                        let text = attributedText.attributedSubstring(from: range).string
                                        sections.append(text)
    }
    return sections
}

let text = "Fully <em>Furni</em>shed |Downtown and Canal Views"
print(getTextSections(text))

結果:

[“ Fully”,“ Furni”,“棚屋|市區和運河景觀”]

不是正則表達式,而是為了獲取標簽中的所有單詞,例如[Furni,sma]:

let text = "Fully <em>Furni<\\/em>shed <em>sma<\\/em>shed |Downtown and Canal Views"
let emphasizedParts = text.components(separatedBy: "<em>").filter { $0.contains("<\\/em>")}.flatMap { $0.components(separatedBy: "<\\/em>").first }

對於完整的單詞,例如[帶家具,已砸碎]:

let emphasizedParts = text.components(separatedBy: " ").filter { $0.contains("<em>")}.map { $0.replacingOccurrences(of: "<\\/em>", with: "").replacingOccurrences(of: "<em>", with: "") }

給定此字符串:

let str = "Fully <em>Furni<\\/em>shed |Downtown and Canal Views"

和相應的NSRange

let range = NSRange(location: 0, length: (str as NSString).length)

讓我們構造一個正則表達式,以匹配<em></em>之間的字母,或者以</em>開頭的字母

let regex = try NSRegularExpression(pattern: "(?<=<em>)\\w+(?=<\\\\/em>)|(?<=<\\\\/em>)\\w+")

它的作用是:

  • 尋找1個或多個字母: \\\\w+
  • <em> 開頭的 (: (?<=<em>) (正向后看 ),
  • 然后是<\\/em>(?=<\\\\\\\\/em>) (正向超前 ),
  • 或: |
  • 字母: \\\\w+
  • <\\/em> (?=<\\\\\\\\/em>)(?=<\\\\\\\\/em>) (正向后看

讓我們得到比賽:

let matches = regex.matches(in: str, range: range)

我們可以將其轉換為子字符串:

let strings: [String] = matches.map { match in
    let start = str.index(str.startIndex, offsetBy: match.range.location)
    let end = str.index(start, offsetBy: match.range.length)
    return String(str[start..<end])
}

現在我們可以將偶數索引中的字符串與奇數索引中的字符串連接起來:

let evenStride = stride(from: strings.startIndex,
               to: strings.index(strings.endIndex, offsetBy: -1),
               by: 2)
let result = evenStride.map { strings[$0] + strings[strings.index($0, offsetBy: 1)]}

print(result)  //["Furnished"]

我們可以用另一個字符串測試它:

let str2 = "<em>Furni<\\/em>shed <em>balc<\\/em>ony <em>gard<\\/em>en"

結果將是:

["Furnished", "balcony", "garden"]

這是PHP的基本實現(是的,我知道您問過Swift,但這是為了演示正則表達式部分):

<?php

$in = "Fully <em>Furni</em>shed |Downtown and Canal Views";

$m = preg_match("/<([^>]+)>([^>]+)<\/\\1>([^ ]+|$)/i", $in, $t);    

$s = $t[2] . $t[3];

echo $s;

輸出:

ZC-MGMT-04:~ jv$ php -q regex.php
Furnished

顯然,最重要的一點是正則表達式部分,它將與任何標簽匹配,並找到相應的結束標簽並隨后提醒

如果您只想提取<em><\\/em>之間的文本(請注意,這不是普通的HTML標簽,因為它原來是<em></em> ),我們可以簡單地捕獲此模式並用捕獲的組1的值替換它。 而且,我們不必擔心匹配文本周圍會出現什么,只需將其替換為那些實際上是空字符串的文本之間捕獲的內容,因為OP對此沒有提及任何約束。 用於匹配此模式的正則表達式為this,

<em>(.*?)<\\\/em>

或為了在技術上更健壯地處理可選空間(如我看到有人在其他答案的注釋中指出的那樣)在標記內的任何位置顯示,我們可以使用此正則表達式,

<\s*em\s*>(.*?)<\s*\\\/em\s*>

並根據您的位置將其替換為\\1$1 現在,這些標簽是否包含空字符串,或其中是否包含一些實際的字符串,都與我在regex101上的演示所示無關緊要。

這是演示

讓我知道這是否滿足您的要求,如果您的要求仍然不滿意,請告訴我。

我強烈推薦使用正則表達式捕獲組

  1. 創建您的正則表達式,輸入所需捕獲組的名稱:
let capturePattern = "(?<=<em>)(?<data1>\\w+)(?=<\\\\/em>)|(?<=<\\\\/em>)(?<data2>\\w+)"
  1. 現在使用 Swift 捕獲模式來獲取數據:
let captureRegex = try! NSRegularExpression(
    pattern: capturePattern,
    options: []
)

let textInput = "Fully <em>Furni<\/em>shed |Downtown and Canal Views"
let textInputRange = NSRange(
    textInput.startIndex..<textInput.endIndex,
    in: textInput
)

let matches = captureRegex.matches(
    in: textInput,
    options: [],
    range: textInputRange
)

guard let match = matches.first else {
    // Handle exception
    throw NSError(domain: "", code: 0, userInfo: nil)
}

let data1Range = match.range(withName: "data1")

// Extract the substring matching the named capture group
if let substringRange = Range(data1Range, in: textInput) {
   let capture = String(textInput[substringRange])
   print(capture)
}

同樣可以獲取data2組名:

let data2Range = match.range(withName: "data2")

if let substringRange = Range(data2Range, in: textInput) {
   let capture = String(textInput[substringRange])
   print(capture)
}

這種方法的主要優點是組索引的獨立性。 這使得這種用法不太依賴於正則表達式。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM