简体   繁体   English

URL模式匹配(PHP)?

[英]URL Pattern Matching (PHP)?

(Programming Language: PHP v5.3) (编程语言:PHP v5.3)

I am working on this website where I make search on specific websites using google and bing search APIs. 我正在该网站上工作,在该网站上我使用google和bing搜索API在特定网站上进行搜索。

The Project: 该项目:

A user can select a website to search from a drop-down list. 用户可以从下拉列表中选择要搜索的网站。 We have an admin panel on this website. 我们在此网站上有一个管理面板。 If the admin wants to add a new website to the drop-down list, he has to provide two sample URLs from the site as shown below. 如果管理员想将新网站添加到下拉列表中,则他必须提供该站点的两个示例URL,如下所示。

表格图片

On the submit of form a code goes through input and generates a regex that we later use for pattern matching. 在提交表单时,代码会经过输入并生成一个正则表达式,稍后我们将其用于模式匹配。 The regex is stored in database for later use. 正则表达式存储在数据库中,以备后用。

In a different form the visiting user selects a website from the drop-down list. 访问用户以另一种形式从下拉列表中选择一个网站。 He then enters the search "query" in a text box. 然后,他在文本框中输入搜索“查询”。 We fetch results as JSON using search APIs(as mentioned above) where we use the following query syntax as search string: 我们使用搜索API(如上所述)以JSON格式获取结果,其中我们使用以下查询语法作为搜索字符串:

"site:website query" “ site:网站查询”
(where we replace "website" with the website user chose for search and replace "query" with user's search query). (我们将“网站”替换为用户选择进行搜索的网站,并将“查询”替换为用户的搜索查询)。

The Problem 问题

Now what we have to do is get the best match of the url. 现在,我们要做的就是获取url的最佳匹配。 The reason for doing a pattern match is that some times there are unwanted links in search results. 进行模式匹配的原因是,有时搜索结果中会有不需要的链接。 For example lets say I search on website "www.example.com" for an article names "abcd". 例如,假设我在网站“ www.example.com”上搜索了名为“ abcd”的文章。 Search engines might return these two urls: 搜索引擎可能会返回以下两个网址:

1) www.example.com/articles/854/abcd 1)www.example.com/articles/854/abcd
2) www.example.com/search/abcd 2)www.example.com/search/abcd

The first url is the one that I want. 第一个网址就是我想要的网址。 Now I have two issues to resolve. 现在我有两个问题要解决。

1) I know that the code that I wrote to make a regex pattern from sample URLs is never going to be perfect considering that the admin adds websites on regular basis. 1)我知道,考虑到管理员定期添加网站,我编写的用于从示例URL制作正则表达式模式的代码永远都不是完美的。 There can never be enough conditions to check for creating a pattern for different websites from same code. 永远不会有足够的条件来检查是否可以使用相同的代码为不同的网站创建模式。 Is there a better way to do this or regex is my only option? 有没有更好的方法做到这一点,或者正则表达式是我唯一的选择?

2) I am developing on a machine running Windows 7 OS. 2)我正在运行Windows 7 OS的计算机上进行开发。 preg_match_all() returns results here. preg_match_all()在这里返回结果。 But when I move the code to server which is running Linux OS, preg_match_all() does not return any results for the same parameters? 但是,当我将代码移至运行Linux OS的服务器时,preg_match_all()不会针对相同参数返回任何结果吗? I can't seem to get why that is happening. 我似乎无法理解为什么会这样。 Anyone knows why is this happening? 有人知道为什么会这样吗?

I have been working on web technologies for only past few weeks, so I don't know if I have better options than regex. 我从事Web技术的工作仅过去几周了,所以我不知道我是否有比regex更好的选择。 I would be very grateful if you could assist me or guide me towards resources where I can find solution for my problems. 如果您能协助我或将我引导到可以为我的问题找到解决方案的资源,我将不胜感激。

About question 1: I can't quite grasp what you're trying to accomplish so I can't give any valid opinion. 关于问题1:我不太了解您要完成的工作,因此我无法提供任何有效的意见。

Regarding question 2: If both servers are running the same version of PHP, the regex library used ought to be the same. 关于问题2:如果两个服务器都运行相同版本的PHP,则使用的regex库应该相同。 You can test this, however, by making a mock static file or string to test against the regex and see if the results are the same. 但是,您可以通过制作一个模拟静态文件或字符串来测试此正则表达式,并查看结果是否相同,从而进行测试。

Since you're grabbing results from the search engines and then parsing them, the data retrieve might not be the same. 由于您是从搜索引擎获取结果然后进行解析,因此数据检索可能会有所不同。 Google/Bing change part of the data regarding the OS you use and that might alter preg results. Google /必应更改与您使用的操作系统有关的部分数据,这可能会更改预编译结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM