如何使用 Python 中的正則表達式從字符串中解析出兩個字段？

Question

我試圖弄清楚如何使用正則表達式從命名方案中解析出字段。 基本上，一種通過查詢字符串獲取 go 並根據命名方案提取模式的方法。 在這種情況下，有兩種模式可以提取，即 ID 和 DIRECTION。

DIRECTION將始終為 1 或 2

ID可以是文件系統允許的任何字符串（例如，字母數字 - _。）

這是我正在嘗試編寫的基本框架：

def function(query:str, naming_scheme:str):
    # stuff
    return (ID, DIRECTION)

這是對命名方案 1 ( naming_scheme_1 ) 的查詢：

naming_scheme_1 = "[ID]_R[DIRECTION].fastq.gz"
ID, DIRECTION = function("Kuwait_110_S59_R1.fastq.gz", naming_scheme_1)
#ID = "Kuwait_110_S59"
#DIRECTION = "1"

ID, DIRECTION = function("Kuwait_110_S59_R2.fastq.gz", naming_scheme_1)
#ID = "Kuwait_110_S59"
#DIRECTION = "2"

這是對命名方案 2 ( naming_scheme_2 ) 的查詢：

naming_scheme_2 = "[ID]_R[DIRECTION]_001.fastq.gz"
ID, DIRECTION = function("Kuwait_110_S59_R1_001.fastq.gz", naming_scheme_2)
#ID = "Kuwait_110_S59"
#DIRECTION = "1"

ID, DIRECTION = function("Kuwait_110_S59_R2_001.fastq.gz", naming_scheme_2)
#ID = "Kuwait_110_S59"
#DIRECTION = "2"

這是命名方案 3 ( naming_scheme_3 ) 的查詢：

naming_scheme_3 = "barcode-Kuwait_110_S59_1.fq"

ID, DIRECTION = function("barcode-Kuwait_110_S59_1.fq", naming_scheme_3)
ID = "Kuwait_110_S59"
DIRECTION = "1"

ID, DIRECTION = function("barcode-Kuwait_110_S59_2.fq", naming_scheme_3)
ID = "Kuwait_110_S59"
DIRECTION = "2"

在這種情況下，如何在 Python 中使用正則表達式（或類似表達式）來解析字段？

我目前的方法是對字符串進行一系列拆分事件，這似乎不是最佳選擇。

Answer 1

如果您的第三個命名方案實際上是

naming_scheme_3 = "barcode-[ID]_[DIRECTION].fq"

然后Python代碼

import re

def get_id_and_direction(query: str):
    matcher = re.match("^(?:barcode-)?(?P<ID>[a-zA-Z0-9._-]+)_R?(?P<DIRECTION>[12])(?:\.fq|(?:_001)?\.fastq\.gz)$",query)
    if matcher:
        return (matcher.group('ID'), matcher.group('DIRECTION'))
    else:
        return ( None, None )

print(get_id_and_direction('Kuwait_110_S59_R1.fastq.gz'))
print(get_id_and_direction('Kuwait_110_S59_R2.fastq.gz'))
print(get_id_and_direction('Kuwait_110_S59_R1_001.fastq.gz'))
print(get_id_and_direction('Kuwait_110_S59_R2_001.fastq.gz'))
print(get_id_and_direction('barcode-Kuwait_110_S59_1.fq'))
print(get_id_and_direction('barcode-Kuwait_110_S59_2.fq'))

將同時為您提供所有 3 種命名方案的 ID 和 DIRECTION：

('Kuwait_110_S59', '1')
('Kuwait_110_S59', '2')
('Kuwait_110_S59', '1')
('Kuwait_110_S59', '2')
('Kuwait_110_S59', '1')
('Kuwait_110_S59', '2')

正則表達式"^(?:barcode-)?(?P<ID>[a-zA-Z0-9._-]+)_R?(?P<DIRECTION>[12])(?:\.fq|(?:_001)?\.fastq\.gz)$"的工作原理如下：

^(?:barcode-)? 在開頭查找可選的“條形碼-”-？ 最后使整個表達式可選。

(?P<ID>[a-zA-Z0-9._-]+)是（命名的）組，它獲取由一個或多個字母數字或“.”、“_”、“-”組成的 ID人物。

_R? 匹配 _R 或僅匹配 _（R 后面的？使 R 可選）始終跟隨 ID。

(?P<DIRECTION>[12])拾取 1 或 2 的（命名）組 - 方向

(?:\.fq|(?:_001)?\.fastq\.gz)$確保字符串以 '.fq' 或 '_001.fastq.gz' 或 '.fastq.gz' 三個可能結尾以您的 3 個命名方案結尾

在此處查看實際代碼： https://onlinegdb.com/yD8WBaPNt

希望這能讓你繼續前進！

Answer 2

評論者要求參數的原因是您沒有給出任何規則。

例如，ID 是否總是包含“string_3 characters_3 characters”？

方向總是一個字符嗎？ 是不是更精致了？ 它總是一個數字嗎？

我已經提供了答案，但是如果沒有足夠的參數，這可能對您沒有太大幫助。 如果我在代碼注釋中概述的假設是正確的，那么這會很好。 話雖如此，如果它不起作用，請放棄您的字符串必須遵循的一些規則。

import re

str1 = "Kuwait_110_S59_R1.fastq.gz"
str2 = "Kuwait_110_S59_R1_001.fastq.gz"
str3 = "barcode-Kuwait_110_S59_1.fq"
str4 = "bar-Kuwait Kuwait_295_235_622.fg"

# this assumes 
#   the first char that matters for ID is always capitalized
#   always 3 characters between the 1st & 2nd hyphen & after 2nd hyphen
#   that direction is always a single character

def gimme(str):
  # look for the single char before period
  ID = re.search("(.)(?=(\.))", str).group(1)
  # look a capital letter then for *_3_3 before _
  DIRECTION = re.search("([A-Z].*_.{3}_.{3})(?=(_))", str).group(1)
  return (ID, DIRECTION)

s1 = gimme(str1)
s2 = gimme(str2)
s3 = gimme(str3)
s4 = gimme(str4)

print(s1)
# ('1', 'Kuwait_110_S59')
print(s2)
# ('1', 'Kuwait_110_S59')
print(s3)
# ('1', 'Kuwait_110_S59')
print(s4)
# ('2', 'Kuwait Kuwait_295_235')

Answer 3

這是代碼：

import re

def repl(match_object):
    inside_bracket = match_object.group(1)
    if inside_bracket == "DIRECTION":
        return r"(?P<DIRECTION>[12])"
    if inside_bracket == "ID":
        return r"(?P<ID>[-.\w]+)"

def function(query: str, naming_scheme: str):
    pattern = re.sub(r"\[(.*?)\]", repl, naming_scheme)
    match = re.match(pattern, query)
    return match["ID"], match["DIRECTION"]

解釋：

最重要的是將您的模板轉換為正則表達式模式，我的意思是：

[ID]_R[DIRECTION].fastq.gz   -->  (?P<ID>[-\w]+)_R(?P<DIRECTION>[12]).fastq.gz

這是在傳遞給re.sub的repl function 的幫助下完成的。 在這個 function 中，我使用了\[(.*?)\]作為模式，它基本上捕獲了括號及其內容。 創建模式時，我使用了您的DIRECTION和ID規則。 [DIRECTION]更改為僅接受1和2的命名組(?P<DIRECTION>[12])並且[ID]更改為(?P<ID>[-.\w]+)用於文件名（假設有文件名中沒有空格）

而已。 現在您有了包含兩個命名組的模式。 1- ID 2- 方向。 它們可以通過match["ID"]和match["DIRECTION"]獲取

這是一個測試：

ID, DIRECTION = function("Kuwait_110_S59_R1.fastq.gz", "[ID]_R[DIRECTION].fastq.gz")
print(ID, DIRECTION)

output：

Kuwait_110_S59
1

注意：我只是考慮了快樂的情況，如果您的模板（查詢）狀態不佳，請不要忘記引發異常。

如何使用 Python 中的正則表達式從字符串中解析出兩個字段？

問題描述

3 個解決方案

解決方案1
2 2022-08-13 05:37:59

解決方案2
1 2022-08-12 01:42:28

解決方案3
1 2022-08-13 06:15:38

如何使用 Python 中的正則表達式從字符串中解析出兩個字段？

問題描述

3 個解決方案

解決方案1 2 2022-08-13 05:37:59

解決方案2 1 2022-08-12 01:42:28

解決方案3 1 2022-08-13 06:15:38

解決方案1
2 2022-08-13 05:37:59

解決方案2
1 2022-08-12 01:42:28

解決方案3
1 2022-08-13 06:15:38