简体   繁体   English

使用正则表达式提取Perl中的前两个单词

[英]Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. 我想使用PostgreSQL中的Perl函数从句子中提取前两个单词。 In PostgreSQL, I can do this with: 在PostgreSQL中,我可以这样:

text = "I am trying to make this work";

Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');

It would return "I Am" 它将返回“我是”

I tried to build a Perl function in Postgresql that does the same thing. 我试图在Postgresql中建立一个执行相同功能的Perl函数。

CREATE OR REPLACE FUNCTION extract_first_two (text)
    RETURNS text AS 
    $$
    my $my_text = $_[0];
    my $temp;

    $pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
    my $regex = qr/$pattern/;
    if ($my_text=~ $regex) {
    $temp = $1;
    }

    return $temp;
    $$ LANGUAGE plperl;

But I receive a syntax error near the regular expression. 但是我在正则表达式附近收到语法错误。 I am not sure what I am doing wrong. 我不确定自己在做什么错。

Extracting words is none trivial even in English. 即使是英语,提取单词也不是一件容易的事。 Take the following contrived example using Locale::CLDR 使用Locale :: CLDR采取以下人为的示例

use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my @words = $locale->split_words('adf543. 123.25');

@words now contains @words现在包含

  • adf543 adf543
  • .
  • 123.25 123.25

Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' 请注意, adf543之后的adf543被分割成一个单独的单词,但是即使'。'也保留在12325之间的123.25作为数字123.25一部分。 is the same character 是相同的角色

If gets worse when you look at non English languages and much worse when you use non Latin scripts. 如果使用非英语语言会变得更糟,而使用非拉丁语脚本会更糟。

You need to precisely define what you think a word is otherwise the following French gets split incorrectly. 您需要精确定义您认为一个单词的含义,否则以下法语会被错误地分割。

Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»» Je avais dit«Elle a dit«Il a dit«Ni»il ya trois secondes»»

The parentheses are mismatched in our regex pattern. 括号在我们的正则表达式模式中不匹配。 It has three opening parentheses and four closing ones. 它有三个开括号和四个闭括号。

Also, you have two single quotes in the middle of a singly-quoted string, so 另外,在单引号字符串的中间有两个单引号,因此

'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'

is parsed as two separate strings 被解析为两个单独的字符串

'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'

and

'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'

But I can't suggest how to fix it as I don't understand your intention. 但由于我不了解您的意图,因此我无法建议如何解决它。

Did you mean a double quote perhaps? 您是说双引号吗? In which case (!|,|\\&|")? can be written as [!,&"]? 在哪种情况下(!|,|\\&|")?可以写成[!,&"]?


Update 更新资料

At a rough guess I think you want this 粗略的猜测,我想你想要这个

my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;

but I can't be sure. 但我不确定。 If you describe what you're looking for in English then I can help you better. 如果您用英语描述您要寻找的东西,那么我会更好地帮助您。 For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation. 例如,目前尚不清楚为什么在中间标点列表中没有问号,句号和分号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM