简体   繁体   English

在C#中使用Regex匹配两个字符串

[英]Matching two strings using Regex in C#

I have two files, each contains a number on each line. 我有两个文件,每个文件的每一行都包含一个数字。 Let's name them File A and File B. Each file contains roughly 2 million lines, and I'm trying to match those numbers using Regex. 我们将它们命名为文件A和文件B。每个文件包含大约200万行,我正尝试使用Regex匹配这些数字。

The tricky part is that, each number in B can be a number in A + "00" or can simply be equal to A. 棘手的是,B中的每个数字都可以是A +“ 00”中的数字,也可以简单地等于A。

B = A OR B = A + "00"

This is how I'm identifying my Regex: 这是我识别正则表达式的方式:

Regex re = new Regex("[0-9](0)*");

My code stores the smaller file into a HashSet, then I build my Regex and do the matching: 我的代码将较小的文件存储到HashSet中,然后构建正则表达式并进行匹配:

HashSet<string> matchVal = new HashSet<string>();

            foreach (var mdeSubid in File.ReadLines("FileA.txt"))
            {
                matchVal.Add(mdeSubid);
            }

            Regex re = new Regex("[0-9](0)*");
            StreamWriter sw = new StreamWriter("output.txt");

            foreach (var rx in File.ReadLines("FileB.txt"))
            {
                var matches = re.Matches(rx);
                foreach (Match m in matches)
                {
                    if (matchVal.Contains(m))
                    {
                        sw.WriteLine(m);   
                    }
                }
            }
            Console.WriteLine("DONE!");
            Console.ReadLine();

When I'm running my script I'm not getting any match even though I'm sure 100% that there are bunch of them. 当我运行脚本时,即使我确定100%都存在它们,我也没有得到任何匹配。

Real Example 真实的例子

File A: 档案A:

846535465 846535465

846536589 846536589

8465631 8465631

File B: 档案B:

84653546500 84653546500

846536589 846536589

846563100 846563100

846563102 846563102

In the example, B should match A in all cases except the last one, 在此示例中,除最后一种情况外,B在所有情况下均应与A匹配,

846535465 == 84653546500
846536589 == 846536589 
8465631 == 846563100
8465631 != 846563101

The issue is your regex. 问题是您的正则表达式。 What you're asking for is a single number followed by as many zeroes as possible. 您要的是一个数字,后跟尽可能多的零。

[0-9] - any number [0-9] -任何数字

(0) - a matching group containing the number zero as many times as possible * (0) -包含数字0的匹配组尽可能多*

I believe what you really want is to match only the first part of the sequence, excluding the trailing zeroes, against a predefined list of numbers. 我相信您真正想要的只是将序列的第一部分(不包括尾随零)与预定义的数字列表进行匹配。 I also believe you should make sure this is at the start of the line only, just in case you get a match elsewhere on the line causing a false positive. 我也相信您应该确保这只是在行的开头,以防万一您在该行的其他地方出现匹配项而导致误报。

Therefore, your regex should be: ^(\\d+)(00)? 因此,您的正则表达式应为: ^(\\d+)(00)?

^ - asset position at the start of the string ^ -字符串开头的资产位置

(\\d+) - first capture group, match any number one or more times (\\d+) -第一个捕获组,一次或多次匹配任何数字

(00)? - second capture group, match '00' zero or one times -第二个捕获组,匹配“ 00”零或一

Finally, you need to alter your code so that rather than checking simply for a match, you are checking that Group(1) is a match (the sequence minus the trailing zeroes) 最后,您需要更改代码,以便检查Group(1)是否匹配(序列减去尾随零),而不是仅检查匹配是否正确。

EDIT 编辑

So I tried this regex out on www.regex101.com and it's not quite right. 因此,我在www.regex101.com上试用了此正则表达式,但不太正确。 I'm asking for the first capture group to match as many digits as possible, which by definition already encompasses the second capture group. 我要求第一个捕获组匹配尽可能多的数字,根据定义,该数字已包含第二个捕获组。 I'd suggest using the first part of the regex ^\\d+ and doing the trailing zero check in code. 我建议使用正则表达式^\\d+的第一部分并在代码中进行尾随零检查。 Although that'll potentially double the time taken checking all those matches, I'm struggling to think of a more efficient method right now... 尽管这可能会使检查所有这些匹配项所需的时间增加一倍,但我现在仍在努力思考一种更有效的方法...

This is far from ideal, but I gave it a shot. 这远非理想,但我试了一下。 It requires reading all the lines of FileA first which isn't ideal if there are millions of lines, but I can't figure out if there is a way around that, that wouldn't slow down the process tremendously. 它需要首先读取FileA的所有行,如果有数百万行,这是不理想的,但是我无法确定是否有解决方法,这不会极大地减慢该过程。

Anyway, here's the commented code, which simply prints the matches to the console. 无论如何,这是注释的代码,它只是将匹配项打印到控制台。

//Read all the lines from FileA
var FileAContents = File.ReadAllLines(PathFileA);

//Read FileB one line at a time, and compare with FileA to search for matches.
string line;
StreamReader file = new StreamReader(PathFileB);
while ((line = file.ReadLine()) != null)
{
    //Start by finding all the possible matches, where the line from FileB starts with something that can be found in FileA.
    List<string> possibleMatches = FileAContents.Where(m => line.StartsWith(m)).ToList();
    foreach(string pm in possibleMatches)
    {
        //If the lines are completely equal, you've found a match.
        if(pm.Equals(line))
        {
            Console.WriteLine("Match found, FileA: {0}, FileB: {1}", pm, line);
        }
        else if(line.EndsWith("00"))
        {
            //Remove the "00", then check if you've found a match.
            string tempLine = line.Substring(0, line.Length - 2);
            if(pm.Equals(tempLine))
            {
                Console.WriteLine("Match found, FileA: {0}, FileB: {1}", pm, line);
            }
        } 
    }
}

The problem you have here is that you're matching against a very large number of values. 这里的问题是您要匹配大量的值。 It is possible that you may run low on memory. 您的内存可能不足。 Here are two possible choices of ways to go. 这是两种可行的选择方式。

The first is the simplest, but may use too much memory: 第一个是最简单的,但是可能会使用过多的内存:

var words = new HashSet<string>(
    File
        .ReadLines("FileA.txt")
        .SelectMany(x => new[] { x, x + "00" }));

var query = from word in File.ReadLines("FileB.txt")
            where words.Contains(word)
            select word;

File.WriteAllLines("output.txt", query);

The second uses a structure called a "trie" which can be far more efficient with the storage of the numbers from the first file. 第二种使用称为“ trie”的结构,通过存储第一个文件中的数字可以更加有效。 It depends on how random the numbers are. 这取决于数字的随机性。

void Main()
{
    var words = File.ReadLines("FileA.txt").SelectMany(x => new[] { x, x + "00" });

    var trie = new Trie(words);

    var query = from word in File.ReadLines("FileB.txt")
                where trie.Contains(word)
                select word;

    File.WriteAllLines("output.txt", query);
}

public class Trie : Dictionary<char, Trie>
{
    public Trie() : base(0) { }
    public Trie(IEnumerable<string> words) : base(0)
    {
        foreach (var word in words)
        {
            this.Add(word);
        }
    }

    public void Add(string word)
    {
        if (String.IsNullOrEmpty(word))
        {
            this[char.MinValue] = null;
        }
        else
        {
            Trie t = null;
            if (this.ContainsKey(word[0]))
            {
                t = this[word[0]];
            }
            else
            {
                t = new Trie();
                this[word[0]] = t;
            }
            t.Add(word.Substring(1));
        }
    }

    public bool Contains(string prefix)
    {
        return this.ContainsInternal(prefix + char.MinValue);
    }

    private bool ContainsInternal(string prefix)
    {
        if (!string.IsNullOrEmpty(prefix) && this.ContainsKey(prefix[0]))
        {
            return prefix.Length == 1 || this[prefix[0]].ContainsInternal(prefix.Substring(1));
        }
        return false;
    }
}

Both versions work in my tests. 这两个版本都可以在我的测试中使用。

Please try the below code (it's using System.Linq ): 请尝试以下代码(它使用System.Linq ):

var fileA = new string[] {"846535465", "846536589", "8465631"};
var fileB = new string[] {"84653546500", "846536589", "846563100", "846563102"};

foreach (var bText in fileB){
    var aText = fileA.FirstOrDefault(a => Regex.Match(bText, ("^(" + a + ")(00)?$")).Success);
    if(!String.IsNullOrEmpty(aText)){
        Console.WriteLine("B: " + bText + " - A: " + aText);
    }       
}

In this case, you check if the line in file B matches any value from file A, either with or without 00 . 在这种情况下,您检查文件B中的行是否与文件A中的任何值匹配(带或不带00)

In the above example, fileA and fileB arrays are your file contents, I just simplified them to be arrays, so you can test it quickly. 在上面的示例中,fileA和fileB数组是您的文件内容,我只是将它们简化为数组,因此可以快速对其进行测试。

EDIT: This example may not be the best when it comes to performance, but you can get the idea from here, to build the Regex dynamically. 编辑:在性能方面,该示例可能不是最好的,但是您可以从这里得到一个动态构建Regex的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM