[英]regular expression that will extract sentences from text file
I need a regular expression that will extract sentences from text file. 我需要一个能从文本文件中提取句子的正则表达式。 example text :
示例文本:
Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query to Google News (http://news.google.com) returned more than 80,000 online news articles about this event within one month (Jan.17 through Feb.17, 2005). 例如,考虑2004年底发生的亚洲海啸灾难。对Google新闻(http://news.google.com)的查询在一个月内(1月17日)返回了超过80,000条关于此事件的在线新闻文章。到2005年2月17日)。 information by mr.
先生的信息。 Kahana.
卡纳。
here's my code : 这是我的代码:
$re = '/(?<=[.!?]|[.!?][\'"])\s+/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
but the last sentence still splitted information by mr.
但最后一句仍然
information by mr.
分裂information by mr.
and Kahana.
和
Kahana.
how to solve it ? 怎么解决? thank you :)
谢谢 :)
You Can't Do this with Regular Expressions 您不能使用正则表达式执行此操作
English as a language does not fit into well-placed formatting rules. 英语作为一种语言不适合放在合适的格式规则中。 As such, regular expressions are not fit to fulfill the purpose you are seeking out.
因此,正则表达式不适合您实现的目的。 What you are really looking for is something like a Natural Language Processor.
你真正想要的是像自然语言处理器。
Unless this is critical to your program, I suggest you instead determine the following things: 除非这对您的计划至关重要,否则我建议您确定以下内容:
My recommendation is to use trial and error to get your error rate down as much as possible. 我的建议是使用反复试验来尽可能降低错误率。 Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate.
在大量文本上运行程序,并继续添加异常,直到获得可接受的错误率。 If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem.
但是,如果你需要超过几十个规则,你可能只想重新考虑这个问题。
In short, PHP and Regular Expressions aren't meant for this because English is funky. 简而言之,PHP和正则表达式并不适用于此,因为英语很时髦。 So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether.
因此要么添加异常以获得较小的错误率,要么完全重新考虑这一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.