简体   繁体   English

使用正则表达式验证输入

[英]verifying input with Regular expression

Verification and classified Input with RE使用 RE 进行验证和分类输入

here now using input like '1-1-AA' and spilts every point with '-' But input can various for example "chr1-1-CG", "3-1-CA","CHRX-34-AT", and etc.这里现在使用像'1-1-AA'这样的输入并用'-'分割每个点但是输入可以不同,例如“chr1-1-CG”,“3-1-CA”,“CHRX-34-AT”,等等。

Which first position should accept "chr1, chr2, ... chr 23, chrX, ChrY", second position should only accept positive number, third and fourth one should only accept one letter from {A,C,G,T}第一个position应该接受“ Chr1,Chr1,Chr1,... Chr2,ChR 23,Chrx,Chry”,第二个Z4757FD492A8BE0EA6A76A760D683D6EK只能接受一个1.15 ENT11 F FORTY 15 ENT 1;

so Im thinking about using '''re.findall''' and use error cases to return warning for incorrect input.所以我正在考虑使用 '''re.findall''' 并使用错误案例来返回错误输入的警告。 But not sure how to give errors with regular expression.但不确定如何使用正则表达式给出错误。

can anyone help?谁能帮忙?

def _check_input(var_str):  # maybe better to check each input seperately
    """
    Checks if the input is a valid variant string
    :param var_str: string supposed to be in the format 'chr-pos-ref-alt'
    :return: bool which tells wether the input is valid
    """
    pattern = re.compile(
        r"""([1-9]|[1][0-9]|[2][0-2]|[XY])  # the chromosome
                        -(\d+)     # the position
                        -[ACGT]+   #RawDescriptionHelpFormatter,
                        -[ACGT]+  # alt""",
        re.X,
    )
    if re.fullmatch(pattern, var_str) is None:
        return False
    else:
        return True


def string_to_dict(inp):
    """
    Converts a variant string into a dictionary
    :param inp: string which should be a valid variant
    :return: dictionary with the variants keys and values
    """
    inp_list = inp.split("-")
    inp_dict = {
        "chr": inp_list[0],
        "pos": inp_list[1],
        "ref": inp_list[2],
        "alt": inp_list[3],
    }
    return inp_dict

Regex is great to check the global validity of sequence.正则表达式非常适合检查序列的全局有效性。 Unfortunately I do not see how you can achieve error checking using one single regex.不幸的是,我看不到如何使用一个正则表达式来实现错误检查。

So I think you can use the regex to check the full validity of the input.所以我认为你可以使用正则表达式来检查输入的完全有效性。 If it is not valid then you can add some more code to warn the user on what might be wrong.如果它无效,那么您可以添加更多代码来警告用户可能出现的问题。

import re


def _check_input(var_str):
    """
    Checks if the input is a valid variant string
    :param var_str: string supposed to be in the format 'chr-pos-ref-alt'
    :return: a match object
    :raises: ValueError on invalid input        
    """
    pattern = re.compile(
        r"(?:chr)?(?P<chr>[1-9]|[1][0-9]|[2][0-3]|[XY])"  # the chromosome
        r"-(?P<pos>\d+)"  # the position
        r"-(?P<ref>[ACGT])"  # RawDescriptionHelpFormatter
        r"-(?P<alt>[ACGT])",  # alt
        re.X | re.IGNORECASE,
    )
    match = re.match(pattern, var_str)

    if not match:
        _input_error_suggestion(var_str)

    return match # you can access values like so match['chr'], match['pos'], match['ref'], match['alt']

def _input_error_suggestion(var_str):
    parts = var_str.split('-')

    if len(parts) != 4:
        raise ValueError('Input should have 4 parts separated by -')

    chr, pos, nucleotide1, nucleotide2 = parts

    # check part 1
    chr_pattern = re.compile(r'(?:chr)?([1-9]|[1][0-9]|[2][0-3]|[XY])', re.IGNORECASE)
    if not re.match(chr_pattern, chr):
        raise ValueError('Input first part should be a chromosome chr1, chr2, ..., chr 23, chrX, chrY')

    # check part 2
    try:
        p = int(pos)
    except ValueError:
        raise ValueError('Input second part should be an integer')
    if p < 0:
        raise ValueError('Input second part should be a positive integer')

    # check part 3 and 4
    for i, n in enumerate((nucleotide1, nucleotide2)):
        if n not in 'ACGT':
            raise ValueError(f"Input part {3 + i} should be one of {{A,C,G,T}}")

    # something else
    raise ValueError(f"Input was malformed, it should be in the format 'chr-pos-ref-alt'")

side notes:旁注:

I improved the original regex by我改进了原来的正则表达式

  • adding the optional "chr",添加可选的“chr”,
  • naming groups,命名组,
  • one and only one letter per nucleotide,每个核苷酸一个且只有一个字母,
  • fixing the missing chromosome 23 and修复缺失的 23 号染色体和
  • allowing to be case insensitive.允许不区分大小写。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM