简体   繁体   中英

How to get Nth number from the specific position of a number in a string - Regex

288007  327920  374740 000368   044575  082865 680798
717374  755879  811106  855460  920577  953515  996819 ......

I have a string containing thousands of 6-digit numbers and I want to extract the Nth numbers after Nth number with the help of regular expression.

Let say I need to extract Three numbers after the 4th number then The result should be 044575 082865 680798 .

another example If I need to extract 2 numbers after the 10th number then the result should be 855460 920577 .

I don't know is this possible with regex, I think FOR EACH statement may be use in my case.

I am only able to extract each six digits number with the code below.

Dim NumberMatchCollection As MatchCollection = Regex.Matches("String containing numbers", "(?<!\d)\d{6}(?!\d)")
For Each NumberMatch As Match In NumberMatchCollection

   Dim ItemNumber As String = NumberMatch.Value

Next

Edited: I can not guarantee that every separator character will be a single space, a double space, a tab or something else. I can just guarantee that the number length always will be 6 which will be separated by space(s) or tab(s).

Wouldn't this be simpler using maths?

Three numbers after the 4th number, is chars (7 * 4) + (7 * 3)

To expand on my comment. This assume that the actual data are divided equaly.

If each number have 6 digits with a space in between. Then the position of the 4th number will be (6+1)*4 and if you want 3 numbers than you just need to fetch (6+1)*3 amount of characters.

    Dim str As String

    str = "288007 327920 374740 000368 044575 082865 680798 717374 755879 811106 855460 920577 953515 996819"

    Dim startingNumber As Integer = 4
    Dim amountToFetch As Integer = 3

    ' 7 = [size of each number] + [delimiter length]
    ' 7 = 6 + 1

    Console.WriteLine(str.Substring(7 * startingNumber, 7 * amountToFetch))
    Console.ReadLine()

You can .Split() the string and use LINQ extension methods on the resulting array:

// some test data...
var rand = new Random();
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= 10000; i++)
{
    sb.Append(i.ToString("000000") + ((rand.Next(5)==1) ? "  ": "\t"));
}
string s = sb.ToString();

string portion = string.Join("  ", s.Split(new [] {' ', '\t'}, StringSplitOptions.RemoveEmptyEntries).Skip(10).Take(3));

Console.WriteLine(portion); // outputs "000011  000012  000013"

Note: for the first number you would .Skip(0) .

But if your string is in the rigid format you show (asuming the variable numbers of spaces are typos, thanks @ErikE), Coenraad's method of calculating where the start of the required string is and how many characters to take would be more efficient. I'll leave it to Coenraad to expand on that answer as it would not be fair to possibly take the points.

I tried and tried to make the regex method be consistently fast, but I found it depended strongly on which numbers you want to retrieve:

在此处输入图片说明

For anyone wanting to test that, I put a default Chart on a Form and used this code:

Imports System.Text
Imports System.Text.RegularExpressions
Imports System.Windows.Forms.DataVisualization
Imports System.Windows.Forms.DataVisualization.Charting

Public Class Form1

    Sub DoStuff()

        Dim ser1 As New Series With {.Name = "String.Split"}
        Dim ser2 As New Series With {.Name = "RegEx"}

        Dim sb As New StringBuilder()

        For i As Integer = 1 To 10000
            sb.Append(i.ToString("000000") + "  ")
        Next
        Dim s As String = sb.ToString()

        Dim sw As New Stopwatch()

        Dim itemsToTake As Integer = 50

        For firstItem = 1 To 9000 Step 100

            sw.Restart()

            Dim portion As String = String.Join(" ", s.Split({" "c}, StringSplitOptions.RemoveEmptyEntries).Skip(firstItem - 1).Take(itemsToTake))

            sw.Stop()
            ser1.Points.AddXY(firstItem -1, sw.ElapsedTicks)

            Dim pattern = "^(?:\d+\s+){" + (firstItem - 1).ToString() + "}((\d+)\s+){" + itemsToTake.ToString() + "}"
            Dim re = New Regex(pattern)

            sw.Restart()
            Dim matches = re.Matches(s)
            Dim cs = matches(0).Groups(0).Captures
            sw.Stop()
            ser2.Points.AddXY(firstItem - 1, sw.ElapsedTicks)

        Next

        Chart1.Series.Clear()
        Chart1.Series.Add(ser1)
        Chart1.Series(0).ChartType = SeriesChartType.Line
        Chart1.Series.Add(ser2)
        Chart1.Series(1).ChartType = SeriesChartType.Line

        Chart1.ChartAreas(0).AxisX.IsMarginVisible = False
        Chart1.ChartAreas(0).AxisX.Title = "First item to retrieve"
        Chart1.ChartAreas(0).AxisY.Title = "Time taken"

    End Sub

    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
        DoStuff()

    End Sub

End Class

If you would like a regex and c# solution, the following code does the 3 numbers after 4th number example.

        var st = @"288007  327920  374740 000368   044575  082865 680798
                  717374  755879  811106  855460  920577  953515  996819";
        var pattern = @"^(\d+\s+){4}((?<x>\d+)\s+){3}";
        var matches = Regex.Matches(st,pattern,RegexOptions.Singleline);
        foreach (Capture m in matches[0].Groups["x"].Captures)
            Console.WriteLine("value={0}", m.Value);

(Edit: removed one group per comment below)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM