简体   繁体   中英

How to extract phrases and then words in a string of text?

I have a search method that takes in a user-entered string, splits it at each space character and then proceeds to find matches based on the list of separated terms:

string[] terms = searchTerms.ToLower().Trim().Split( ' ' );

Now I have been given a further requirement: to be able to search for phrases via double quote delimiters a la Google. So if the search terms provided were:

"a line of" text

The search would match occurrences of "a line of" and "text" rather than the four separate terms [the open and closing double quotes would also need to be removed before searching].

How can I achieve this in C#? I would assume regular expressions would be the way to go, but haven't dabbled in them much so don't know if they are the best solution.

If you need any more info, please ask. Thanks in advance for the help.

Here's a regex pattern that would return matches in groups named ' term ':

("(?<term>[^"]+)"\s*|(?<term>[^ ]+)\s*)+

So for the input:

"a line" of text

The output items identified by the ' term ' group would be:

a line
of
text

Regular expressions would definitely be the way to go...

You should check this MSDN link out for some info on the Regex class: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx

and here is an excellent link to learn some regular expression syntax: http://www.radsoftware.com.au/articles/regexlearnsyntax.aspx

Then to add some code examples, you could be doing it something along these lines:

string searchString = "a line of";

Match m = Regex.Match(textToSearch, searchString);

or if you just want to find out if the string contains a match or not:

bool success = Regex.Match(textToSearch, searchString).Success;

use the regular expression builder here

http://gskinner.com/RegExr/

and you will be able to manipulate the regular expression to how you need it displayed

Use Regexs....

string textToSearchIn = ""a line of" text";
string result = Regex.Match(textToSearchIn, "(?<=").*?(?=")").Value;

or if more then one, put this into a match collection...

MatchCollection allPhrases = Regex.Matches(textToSearchIn, "(?<=").*?(?=")");

The Knuth-Morris-Pratt (KMP algorithm)is recognised as the fastest algorithm for finding substrings in strings (well, technically not strings but byte-arrays).

using System.Collections.Generic;

namespace KMPSearch
{
    public class KMPSearch
    {
        public static int NORESULT = -1;

        private string _needle;
        private string _haystack;
        private int[] _jumpTable;

        public KMPSearch(string haystack, string needle)
        {
            Haystack = haystack;
            Needle = needle;
        }

        public void ComputeJumpTable()
        {
            //Fix if we are looking for just one character...
            if (Needle.Length == 1)
            {
                JumpTable = new int[1] { -1 };
            }
            else
            {
                int needleLength = Needle.Length;
                int i = 2;
                int k = 0;

                JumpTable = new int[needleLength];
                JumpTable[0] = -1;
                JumpTable[1] = 0;

                while (i <= needleLength)
                {
                    if (i == needleLength)
                    {
                        JumpTable[needleLength - 1] = k;
                    }
                    else if (Needle[k] == Needle[i])
                    {
                        k++;
                        JumpTable[i] = k;
                    }
                    else if (k > 0)
                    {
                        JumpTable[i - 1] = k;
                        k = 0;
                    }

                    i++;
                }
            }
        }

        public int[] MatchAll()
        {
            List<int> matches = new List<int>();
            int offset = 0;
            int needleLength = Needle.Length;
            int m = Match(offset);

            while (m != NORESULT)
            {
                matches.Add(m);
                offset = m + needleLength;
                m = Match(offset);
            }

            return matches.ToArray();
        }

        public int Match()
        {
            return Match(0);
        }

        public int Match(int offset)
        {
            ComputeJumpTable();

            int haystackLength = Haystack.Length;
            int needleLength = Needle.Length;

            if ((offset >= haystackLength) || (needleLength > ( haystackLength - offset))) 
                return NORESULT;

            int haystackIndex = offset;
            int needleIndex = 0;

            while (haystackIndex < haystackLength)
            {
                if (needleIndex >= needleLength)
                    return haystackIndex;

                if (haystackIndex + needleIndex >= haystackLength)
                    return NORESULT;

                if (Haystack[haystackIndex + needleIndex] == Needle[needleIndex])
                {
                    needleIndex++;
                } 
                    else
                {
                    //Naive solution
                    haystackIndex += needleIndex;

                    //Go back
                    if (needleIndex > 1)
                    {
                        //Index of the last matching character is needleIndex - 1!
                        haystackIndex -= JumpTable[needleIndex - 1];
                        needleIndex = JumpTable[needleIndex - 1];
                    }
                    else
                        haystackIndex -= JumpTable[needleIndex];


                }
            }

            return NORESULT;
        }

        public string Needle
        {
            get { return _needle; }
            set { _needle = value; }
        }

        public string Haystack
        {
            get { return _haystack; }
            set { _haystack = value; }
        }

        public int[] JumpTable
        {
            get { return _jumpTable; }
            set { _jumpTable = value; }
        }
    }
}

Usage :-

using System;
using System.Collections.Generic;
using System.Text;
using System.Reflection;
namespace KMPSearch
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length != 2)
            {
                Console.WriteLine("Usage: " + Environment.GetCommandLineArgs()[0] + " haystack needle");
            }
            else
            {
                KMPSearch search = new KMPSearch(args[0], args[1]);
                int[] matches = search.MatchAll();
                foreach (int i in matches)
                    Console.WriteLine("Match found at position " + i+1);
            }
        }

    }
}

Try this, It'll return an array for text. ex: { "a line of" text "notepad" }:

string textToSearch = "\"a line of\" text \" notepad\"";

MatchCollection allPhrases = Regex.Matches(textToSearch, "(?<=\").*?(?=\")");

var RegArray = allPhrases.Cast<Match>().ToArray();

output: {"a line of","text"," notepad" }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM