简体   繁体   English

在同一行中提取多个子字符串

[英]Extract multiple substring in the same line

I'm trying to build a logparser but i'm stuck.我正在尝试构建一个 logparser,但我被卡住了。 Right now my program goes trough multiple file in a directory and read all the file line by line.现在我的程序通过一个目录中的多个文件并逐行读取所有文件。 I was able to identify the substring i was looking for "fct=" and extract the value next to the "=" using delimiter but i notice that when i have a line with more then one "fct=" it doesnt see it.我能够识别我正在寻找的子字符串“fct=”,并使用分隔符提取“=”旁边的值,但我注意到当我有一行超过一个“fct=”时,它没有看到它。

So i restart my code and i find a way to get the index position of all occurence of fct= in the same line using an extension method that put the index in a list but i dont see how i can use this list to get the value next to the "=" and using my delimiter.所以我重新启动我的代码,我找到了一种方法来获取 fct= 在同一行中使用将索引放在列表中的扩展方法的所有出现的索引位置,但我不知道如何使用此列表来获取值在“=”旁边并使用我的分隔符。

How can i extract the value next to the "=" knowing the start position of "fct=" and the delimiter at the end of the wanted value?如何在知道“fct=”的开始位置和所需值末尾的分隔符的情况下提取“=”旁边的值?

I'm starting in C# so let me know if i can give you more information.我从 C# 开始,所以如果我可以给你更多信息,请告诉我。 Thanks,谢谢,

Here's an example of what i would like to parse:这是我想要解析的示例:

<dat>FCT=10019,XN=KEY,CN=ROHWEPJQSKAUMDUC FCT=666</dat></logurl>
<dat>XN=KEY,CN=RTU FCT=4515</dat></logurl>
<dat>XN=KEY,CN=RT</dat></logurl>

I would like t retrieve 10019,666 and 4515.我想检索 10019,666 和 4515。

namespace LogParserV1
{
class Program
{

    static void Main(string[] args)
    {

        int counter = 0;
        string[] dirs = Directory.GetFiles(@"C:/LogParser/LogParserV1", "*.txt");
        string fctnumber;
        char[] enddelimiter = { '<', ',', '&', ':', ' ', '\\', '\'' };

        foreach (string fileName in dirs)
        {
            StreamReader sr = new StreamReader(fileName);

            {
                String lineRead;
                while ((lineRead = sr.ReadLine()) != null)
                {

                    if (lineRead.Contains("fct="))
                    {
                        List<int> list = MyExtensions.GetPositions(lineRead, "fct");
                        //int start = lineRead.IndexOf("fct=") + 4;
                       // int end = lineRead.IndexOfAny(enddelimiter, start);
                        //string result = lineRead.Substring(start, end - start);

                        fctnumber = result;

                        //System.Console.WriteLine(fctnumber);
                        list.ForEach(Console.WriteLine);
                    }
                    // affiche tout les ligne System.Console.WriteLine(lineRead);
                    counter++;
                }
                System.Console.WriteLine(fileName);

                sr.Close();
            }
        }

        // Suspend the screen.  
        System.Console.ReadLine();

    }
}
}


namespace ExtensionMethods
{
public  class MyExtensions
{
    public static List<int> GetPositions(string source, string searchString)
    {
        List<int> ret = new List<int>();
        int len = searchString.Length;
        int start = -len;
        while (true)
        {
            start = source.IndexOf(searchString, start + len);
            if (start == -1)
            {
                break;
            }
            else
            {
                ret.Add(start);
            }
        }
        return ret;
    }
    }
}

You could simplify your code a lot by using Regex pattern matching instead.您可以通过使用Regex模式匹配来大大简化您的代码。

The following pattern: (?<=FCT=)[0-9]* will match any group of digits preceded by FCT= .以下模式: (?<=FCT=)[0-9]*将匹配任何以FCT=的数字组。

Try it out试试看

This enables us to do the following:这使我们能够执行以下操作:

string input = "<dat>FCT=10019,XN=KEY,CN=ROHWEPJQSKAUMDUC FCT=666</dat></logurl>...";
string pattern = "(?<=FCT=)[0-9]*";
var values = Regex.Matches(input, pattern).Cast<Match>().Select(x => x.Value);

I have tested this solution with your data, and it gives me the expected results (10019,666 and 4515)我已经用你的数据测试了这个解决方案,它给了我预期的结果(10019,666 和 4515)

string data = @"<dat>FCT=10019,XN=KEY,CN=ROHWEPJQSKAUMDUC FCT=666</dat></logurl>
                <dat>XN=KEY,CN=RTU FCT=4515</dat></logurl>
                <dat>XN=KEY,CN=RT</dat></logurl>";

char[] delimiters = { '<', ',', '&', ':', ' ', '\\', '\'' };

Regex regex = new Regex("fct=(.+)", RegexOptions.IgnoreCase);

var values = data.Split(delimiters).Select(x => regex.Match(x).Groups[1].Value);
values = values.Where(x => !string.IsNullOrWhiteSpace(x));

values.ToList().ForEach(Console.WriteLine);  

I hope my solution will be helpful, let me know.我希望我的解决方案会有所帮助,让我知道。

Below code is usefull to extract the repeated words with linq in text下面的代码可用于提取文本中带有 linq 的重复单词

string text = "Hi Naresh, How are you. You will be next Super man";
    IEnumerable<string> strings = text.Split(' ').ToList();
    var result = strings.AsEnumerable().Select(x => new {str = Regex.Replace(x.ToLowerInvariant(), @"[^0-9a-zA-Z]+", ""), count = Regex.Matches(text.ToLowerInvariant(), @"\b" + Regex.Escape(Regex.Replace(x.ToLowerInvariant(), @"[^0-9a-zA-Z]+", "")) + @"\b").Count}).Where(x=>x.count>1).GroupBy(x => x.str).Select(x => x.First());
    foreach(var item in result)
    {
        Console.WriteLine(item.str +" = "+item.count.ToString());
    }

You can split the line by string[]您可以按 string[] 分割该行

char[] enddelimiter = { '<', ',', '&', ':', ' ', '\\', '\'' };
while ((lineRead = sr.ReadLine()) != null)
            {
               string[] parts1 = lineRead.Split(new string[] { "fct=" },StringSplitOptions.None);

                if(parts1.Length > 0)
        {
            foreach(string _ar in parts1)
            {
                if(!string.IsNullOrEmpty(_ar))
                {
                    if(_ar.IndexOfAny(enddelimiter) > 0)
                    {
                        MessageBox.Show(_ar.Substring(0, _ar.IndexOfAny(enddelimiter)));
                    }
                    else
                    {
                        MessageBox.Show(_ar);
                    }
                }
            }
        }
     }

As always, break down the porblem into smaller bits.与往常一样,将问题分解成更小的部分。 See if the following methods help in any way.看看以下方法是否有任何帮助。 Tying it up to your code is left as an excercise.将它与您的代码联系起来是一种练习。

private const string Prefix = "fct=";

//make delimiter look up fast
private static HashSet<char> endDelimiters = 
    new HashSet<char>(new [] { '<', ',', '&', ':', ' ', '\\', '\'' });

private static string[] GetAllFctFields(string line) =>
    line.Split(new string[] { Prefix });

private static bool TryGetValue(string delimitedString, out string value)
{
    var buffer = new StringBuilder(delimitedString.Length);

    foreach (var c in delimitedString)
    {
        if (endDelimiters.Contains(c)) 
            break;

        buffer.Append(c);
    }

    //I'm assuming that no end delimiter is a format error.
    //Modify according to requirements
    if (buffer.Length == delimitedString.Length) 
    {
        value = null;
        return false;
    }

    value = buffer.ToString();
    return true;
}

Something like :就像是 :

class Program
{
    static void Main(string[] args)
    {
        char[] enddelimiter = { '<', ',', '&', ':', ' ', '\\', '\'' };
        var fct = "fct=";

        var lineRead = "fct=value1,useless text fct=vfct=alue2,fct=value3";

        var values = new List<string>();
        int start = lineRead.IndexOf(fct);
        while(start != -1)
        {
            start += fct.Length;
            int end = lineRead.IndexOfAny(enddelimiter, start);
            if (end == -1)
                end = lineRead.Length;
            string result = lineRead.Substring(start, end - start);
            values.Add(result);
            start = lineRead.IndexOf(fct, end);
        }
        values.ForEach(Console.WriteLine);
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM