简体   繁体   English

解析搜索字符串

[英]parse search string

I have search strings, similar to the one bellow: 我有搜索字符串,类似于以下所示:

energy food "olympics 2010" Terrorism OR "government" OR cups NOT transport

and I need to parse it with PHP5 to detect if the content belongs to any of the following clusters: 并且我需要使用PHP5进行解析,以检测内容是否属于以下任何集群:

  • AllWords array AllWords数组
  • AnyWords array AnyWords数组
  • NotWords array NotWords数组

These are the rules i have set: 这些是我设定的规则:

  1. If it has OR before or after the word or quoted words if belongs to AnyWord. 如果在单词或带引号的单词之前或之后具有OR,则属于AnyWord。
  2. If it has a NOT before word or quoted words it belongs to NotWords 如果它在单词或带引号的单词之前有一个NOT,则它属于NotWords
  3. If it has 0 or more more spaces before the word or quoted phrase it belongs to AllWords. 如果单词或带引号的短语之前有0个或更多空格,则它属于AllWords。

So the end result should be something similar to: 因此,最终结果应类似于以下内容:

AllWords: (energy, food, "olympics 2010")
AnyWords: (terrorism, "government", cups)
NotWords: (Transport)

What would be a good way to do this? 什么是做到这一点的好方法?

If you want to do this with Regex, be aware that your parsing will break on stupid user input (the user, not the input =) ). 如果要使用Regex进行此操作,请注意,您的解析将在愚蠢的用户输入(用户,而不是input =)上中断。

I'd try the following Regexes. 我会尝试以下正则表达式。

NotWords: 非字词:

(?<=NOT\s)\b((?!NOT|OR)\w+|"[^"]+")\b

AllWords: AllWords:

(?<!OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?!\s+OR)

AnyWords: Well.. the rest. AnyWords:好吧..其余的。 =) They are not that easy to spot, since I do not know how to put "OR behind it or OR in front of it" into regex. =)它们并不是那么容易发现,因为我不知道如何在正则表达式中加上“或”或“或”。 Maybe you could join the results from the three regexes 也许您可以加入三个正则表达式的结果

(?<=OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?!\s+OR)
(?<=OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?=\s+OR)
(?<!OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?=\s+OR)

Problems: These require exactly one space between modifier words and expressions. 问题:这些要求修饰词和表达式之间恰好有一个空格。 PHP only supports lookbehinds for fixes length expressions, so I see no way around that, sorry. PHP仅支持lookbehinds来修复长度表达式,所以抱歉,我看不到任何解决方法。 You could just use \\b(\\w+|"[^"]+")\\b to split the input, and parse the resulting array manually. 您可以只使用\\b(\\w+|"[^"]+")\\b拆分输入,然后手动解析结果数组。

This is an excellent example of how an test-first driven approach can help you arrive at a solution. 这是一个很好的例子,说明了测试优先驱动的方法如何帮助您找到解决方案。 It might not be the very best one, but having tests written allow you to refactor with confidence and instantly see if you break any of the existing tests. 它可能不是最好的,但是编写测试可以使您信心十足地进行重构,并立即查看是否破坏了任何现有测试。 Anyway, you could set up a few tests like: 无论如何,您可以设置一些测试,例如:

public function setUp () {
  $this->searchParser = new App_Search_Parser();
}

public function testSingleWordParsesToAllWords () {
  $this->searchParser->parse('Transport');
  $this->assertEquals(
     $this->searchParser->getAllWords(), 
     array('Transport')
  );
  $this->assertEquals($this->searchParser->getNotWords(), array());
  $this->assertEquals($this->searchParser->getAnyWords());
}

public function testParseOfCombinedSearchString () {
   $query = 'energy food "olympics 2010" Terrorism ' . 
            'OR "government" OR cups NOT transport';
   $this->searchParser->parse($query);

  $this->assertEquals(
     $this->searchParser->getAllWords(), 
     array('energy', 'food', 'olympics 2010')
  );
  $this->assertEquals(
     $this->searchParser->getNotWords(), 
     array('Transport')
  );
  $this->assertEquals(
     $this->searchParser->getAnyWords(),
     array( 'terrorism', 'government', 'cups')
  );
}

Other good tests would include: 其他好的测试包括:

  • testParseTwoWords
  • testParseTwoWordsWithOr
  • testParseSimpleWithNot
  • testParseInvalid
    • Here you have to decide what invalid input looks like and how you interpret it, ie: 在这里,您必须确定无效输入是什么样以及如何解释它,即:
    • 'NOT Transport': Search for anything that doesn't contain Transport or inform the user that he has to include at least one search term too? “不运输”:搜索不包含运输的任何东西,或者告知用户他也必须至少包含一个搜索词?
    • 'OR energy': Is it ok to begin with a combinator? “或能量”:可以从组合器开始吗?
    • 'food OR NOT energy': Does this mean "search for food or anything that doesn't contain energy", or does it mean "search for food and not energy", or doesn't it mean anything? “食物还是没有能量”:这意味着“寻找食物或不含能量的任何事物”,还是意味着“寻找食物而不是能量的事物”,或者这并不意味着什么? (ie throw exception, return false or whatnot) (即抛出异常,返回false或其他)
  • testParseEmpty

Then, write the tests one by one, and write a simple solution that passes the test. 然后,一个接一个地编写测试,并编写一个通过测试的简单解决方案。 Then refactor and make it right, and run again to see that you still pass the test. 然后重构并使其正确,然后再次运行以查看您仍然通过了测试。 Once a test passes and the code is refactored, then write the next test and repeat the procedure. 测试通过并重构代码后,请编写下一个测试并重复该过程。 Add more tests as you find special cases and refactor the code so that it passes all tests. 发现特殊情况后添加更多测试,并重构代码,使其通过所有测试。 If you break a test, back-up and re-write the code (not the test!) such that it passes. 如果您破坏测试,请备份并重新编写代码(而不是测试!),使其通过。

As for how you can solve this problem, look into preg_match , strtok or rely simply loop through the string adding up tokens as you go. 至于如何解决此问题,请查看preg_matchstrtok或依靠循环遍历字符串添加标记。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM