[英]Regex that search for sentences that exclude one word
Ciao guys, 大家好
I'm creating a corpus composed with tweets that contain the keyword " catastrophic " in XML format. 我正在创建一个包含推文的语料库,这些推文包含XML格式的关键字“ catastrophic ”。 Each tweet are embedded like this:
每个推文都这样嵌入:
<tweet>"Catastrophic loss" at Tennessee's Zoo Knoxville as 33 reptiles are found dead </tweet>
<tweet>Overcoming Catastrophic Forgetting by Incremental Moment Matching, Lee et al.</tweet
After trimming tons of unnecessary data, there are still like 200+ tweets that don't contain the keyword at all. 修剪掉大量不必要的数据后,仍然有200多个推文根本不包含关键字。 I'd like to delete them, so I tried regex like this, but it just didn't work:
我想删除它们,所以我尝试了这种正则表达式,但是它不起作用:
<tweet>^.*(?!catastrophic).*$</tweet>
Does anybody has any idea? 有人知道吗?
Not sure what programming language or other toolset you are using. 不确定您使用的是哪种编程语言或其他工具集。
But a quite simple approach might be to re-write the file (or whatever kind of input it is) using a filter that only writes the entries that do contain catastrophic: 但是,一种非常简单的方法可能是使用仅写入确实包含灾难性条目的过滤器来重写文件(或文件的输入类型):
Assuming that it is a file with one line per tweet (just to illustrate the idea): 假设它是一个文件,每条推文只有一行(只是为了说明这一点):
egrep '<tweet>.*catastrophic.*</tweet>' originalFile > newFile
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.