[英]How to extract phrases and then words in a string of text?
I have a search method that takes in a user-entered string, splits it at each space character and then proceeds to find matches based on the list of separated terms: 我有一个搜索方法,该方法接受用户输入的字符串,在每个空格字符处将其分割,然后根据分隔的术语列表查找匹配项:
string[] terms = searchTerms.ToLower().Trim().Split( ' ' );
Now I have been given a further requirement: to be able to search for phrases via double quote delimiters a la Google. 现在,我又有了进一步的要求:能够通过双引号分隔符la Google搜索短语。 So if the search terms provided were: 因此,如果提供的搜索词是:
"a line of" text “一行”文字
The search would match occurrences of "a line of" and "text" rather than the four separate terms [the open and closing double quotes would also need to be removed before searching]. 搜索将匹配出现的“一行”和“文本”,而不是四个单独的术语(在搜索之前,也需要删除左引号和右引号)。
How can I achieve this in C#? 如何在C#中实现呢? I would assume regular expressions would be the way to go, but haven't dabbled in them much so don't know if they are the best solution. 我认为正则表达式是解决问题的一种方法,但是并没有花太多时间研究它们,因此不知道它们是否是最佳解决方案。
If you need any more info, please ask. 如果您需要更多信息,请询问。 Thanks in advance for the help. 先谢谢您的帮助。
Here's a regex pattern that would return matches in groups named ' term
': 这是一个正则表达式模式,它将返回名为“ term
”的组中的匹配term
:
("(?<term>[^"]+)"\s*|(?<term>[^ ]+)\s*)+
So for the input: 因此对于输入:
"a line" of text
The output items identified by the ' term
' group would be: “ term
”组标识的输出项目为:
a line
of
text
Regular expressions would definitely be the way to go... 正则表达式绝对是必经之路...
You should check this MSDN link out for some info on the Regex class: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx 您应该检查此MSDN链接以获取有关Regex类的一些信息: http : //msdn.microsoft.com/zh-cn/library/system.text.regularexpressions.regex.aspx
and here is an excellent link to learn some regular expression syntax: http://www.radsoftware.com.au/articles/regexlearnsyntax.aspx 这是学习一些正则表达式语法的绝佳链接: http : //www.radsoftware.com.au/articles/regexlearnsyntax.aspx
Then to add some code examples, you could be doing it something along these lines: 然后添加一些代码示例,您可以按照以下方式进行操作:
string searchString = "a line of";
Match m = Regex.Match(textToSearch, searchString);
or if you just want to find out if the string contains a match or not: 或者如果您只想查找字符串是否包含匹配项:
bool success = Regex.Match(textToSearch, searchString).Success;
use the regular expression builder here 在这里使用正则表达式生成器
http://gskinner.com/RegExr/ http://gskinner.com/RegExr/
and you will be able to manipulate the regular expression to how you need it displayed 并且您将能够操纵正则表达式以使其显示所需的方式
Use Regexs.... 使用正则表达式...
string textToSearchIn = ""a line of" text"; string textToSearchIn =“”一行“ text”;
string result = Regex.Match(textToSearchIn, "(?<=").*?(?=")").Value; 字符串结果= Regex.Match(textToSearchIn,“(?<=”)。*?(?=“)”)。Value;
or if more then one, put this into a match collection... 或者如果多于一个,将其放入比赛集合中...
MatchCollection allPhrases = Regex.Matches(textToSearchIn, "(?<=").*?(?=")"); MatchCollection allPhrases = Regex.Matches(textToSearchIn,“(?<=”)。*?(?=“)”);
The Knuth-Morris-Pratt (KMP algorithm)is recognised as the fastest algorithm for finding substrings in strings (well, technically not strings but byte-arrays). Knuth-Morris-Pratt (KMP算法)被认为是查找字符串(技术上不是字符串而是字节数组)中子字符串的最快算法。
using System.Collections.Generic;
namespace KMPSearch
{
public class KMPSearch
{
public static int NORESULT = -1;
private string _needle;
private string _haystack;
private int[] _jumpTable;
public KMPSearch(string haystack, string needle)
{
Haystack = haystack;
Needle = needle;
}
public void ComputeJumpTable()
{
//Fix if we are looking for just one character...
if (Needle.Length == 1)
{
JumpTable = new int[1] { -1 };
}
else
{
int needleLength = Needle.Length;
int i = 2;
int k = 0;
JumpTable = new int[needleLength];
JumpTable[0] = -1;
JumpTable[1] = 0;
while (i <= needleLength)
{
if (i == needleLength)
{
JumpTable[needleLength - 1] = k;
}
else if (Needle[k] == Needle[i])
{
k++;
JumpTable[i] = k;
}
else if (k > 0)
{
JumpTable[i - 1] = k;
k = 0;
}
i++;
}
}
}
public int[] MatchAll()
{
List<int> matches = new List<int>();
int offset = 0;
int needleLength = Needle.Length;
int m = Match(offset);
while (m != NORESULT)
{
matches.Add(m);
offset = m + needleLength;
m = Match(offset);
}
return matches.ToArray();
}
public int Match()
{
return Match(0);
}
public int Match(int offset)
{
ComputeJumpTable();
int haystackLength = Haystack.Length;
int needleLength = Needle.Length;
if ((offset >= haystackLength) || (needleLength > ( haystackLength - offset)))
return NORESULT;
int haystackIndex = offset;
int needleIndex = 0;
while (haystackIndex < haystackLength)
{
if (needleIndex >= needleLength)
return haystackIndex;
if (haystackIndex + needleIndex >= haystackLength)
return NORESULT;
if (Haystack[haystackIndex + needleIndex] == Needle[needleIndex])
{
needleIndex++;
}
else
{
//Naive solution
haystackIndex += needleIndex;
//Go back
if (needleIndex > 1)
{
//Index of the last matching character is needleIndex - 1!
haystackIndex -= JumpTable[needleIndex - 1];
needleIndex = JumpTable[needleIndex - 1];
}
else
haystackIndex -= JumpTable[needleIndex];
}
}
return NORESULT;
}
public string Needle
{
get { return _needle; }
set { _needle = value; }
}
public string Haystack
{
get { return _haystack; }
set { _haystack = value; }
}
public int[] JumpTable
{
get { return _jumpTable; }
set { _jumpTable = value; }
}
}
}
Usage :- 用法:-
using System;
using System.Collections.Generic;
using System.Text;
using System.Reflection;
namespace KMPSearch
{
class Program
{
static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine("Usage: " + Environment.GetCommandLineArgs()[0] + " haystack needle");
}
else
{
KMPSearch search = new KMPSearch(args[0], args[1]);
int[] matches = search.MatchAll();
foreach (int i in matches)
Console.WriteLine("Match found at position " + i+1);
}
}
}
}
Try this, It'll return an array for text. 试试这个,它将返回一个文本数组。 ex: { "a line of" text "notepad" }: 例如:{“一行”文字“记事本”}:
string textToSearch = "\"a line of\" text \" notepad\"";
MatchCollection allPhrases = Regex.Matches(textToSearch, "(?<=\").*?(?=\")");
var RegArray = allPhrases.Cast<Match>().ToArray();
output: {"a line of","text"," notepad" } 输出:{“一行”,“文本”,“记事本”}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.