简体   繁体   English

正则表达式,将从文本文件中提取句子

[英]regular expression that will extract sentences from text file

I need a regular expression that will extract sentences from text file. 我需要一个能从文本文件中提取句子的正则表达式。 example text : 示例文本:

Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query to Google News (http://news.google.com) returned more than 80,000 online news articles about this event within one month (Jan.17 through Feb.17, 2005). 例如,考虑2004年底发生的亚洲海啸灾难。对Google新闻(http://news.google.com)的查询在一个月内(1月17日)返回了超过80,000条关于此事件的在线新闻文章。到2005年2月17日)。 information by mr. 先生的信息。 Kahana. 卡纳。

here's my code : 这是我的代码:

$re = '/(?<=[.!?]|[.!?][\'"])\s+/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

but the last sentence still splitted information by mr. 但最后一句仍然information by mr.分裂information by mr. and Kahana. Kahana. how to solve it ? 怎么解决? thank you :) 谢谢 :)

You Can't Do this with Regular Expressions 您不能使用正则表达式执行此操作

English as a language does not fit into well-placed formatting rules. 英语作为一种语言不适合放在合适的格式规则中。 As such, regular expressions are not fit to fulfill the purpose you are seeking out. 因此,正则表达式不适合您实现的目的。 What you are really looking for is something like a Natural Language Processor. 你真正想要的是像自然语言处理器。

Unless this is critical to your program, I suggest you instead determine the following things: 除非这对您的计划至关重要,否则我建议您确定以下内容:

  • What is an acceptable level of error? 什么是可接受的错误级别? Nothing you do will be perfect. 你所做的一切都不会是完美的。 But if it works 80% is that okay? 但如果它的工作原理80%是可以的吗? 90%? 90%? 99%? 99%? How critical is this to you/your client? 这对您/您的客户有多重要?
  • Where is the text coming from? 文字来自哪里? For example, a textbook will most likely be written differently than people's twitter feeds. 例如,教科书的写作很可能与人们的推特信息不同。 You can do research and make exceptions based on what you see in the actual text you are using. 您可以根据您在实际使用的文本中看到的内容进行研究并制作例外。
  • What am I doing with the text? 我在做什么文字? If you are just indexing things like keywords, then it doesn't matter (as much) if you get the sentences split correctly. 如果您只是为关键字之类的内容编制索引,那么如果您正确分割句子并不重要(同样多)。 It's all about tuning the program to get the appropriate output for this specific purpose. 这一切都是为了调整程序以获得适合此特定目的的输出。

My recommendation is to use trial and error to get your error rate down as much as possible. 我的建议是使用反复试验来尽可能降低错误率。 Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate. 在大量文本上运行程序,并继续添加异常,直到获得可接受的错误率。 If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem. 但是,如果你需要超过几十个规则,你可能只想重新考虑这个问题。

In short, PHP and Regular Expressions aren't meant for this because English is funky. 简而言之,PHP和正则表达式并不适用于此,因为英语很时髦。 So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether. 因此要么添加异常以获得较小的错误率,要么完全重新考虑这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM