简体   繁体   English

如何使用javascript删除破折号前的单词?

[英]How do I remove a word before a dash using javascript?

I need to remove all words before the dash at the beginning of each sentence. 我需要删除每个句子开头破折号之前的所有单词。 Some sentences do not have words before dashes and dashes within the long sentence need to stay. 有些句子没有单词,而长句中的破折号需要停留。 Here is an example: 这是一个例子:

How do I change these strings: 如何更改这些字符串:

PARIS — President Nicolas Sarkozy, running from behind for reelection... 巴黎-总统尼古拉·萨科奇(Nicolas Sarkozy)从后面竞选连任...

GAZA CITY —Cross-border fighting between Gaza and Israel... 加沙市-加沙与以色列之间的跨界战斗...

CARURU, Colombia — Quite suddenly, the endless green of Amazonian forest... 哥伦比亚卡鲁鲁-突然之间,亚马逊森林的无尽绿色……

A year after an earthquake and tsunami devastated Japan's northeastern coast... 地震和海啸袭击日本东北海岸一年后...

Into these strings: 放入这些字符串:

President Nicolas Sarkozy, running from behind for reelection... 总统尼古拉·萨科奇(Nicolas Sarkozy)从后面竞选连任...

Cross-border fighting between Gaza and Israel... 加沙和以色列之间的跨界战斗...

Quite suddenly, the endless green of Amazonian forest... 突然之间,亚马逊森林无尽的绿色……

A year after an earthquake and tsunami devastated Japan's northeastern coast... 地震和海啸袭击日本东北海岸一年后...

How can I accomplish this with javascript (or php if javascript doesn't allow it)? 如何使用javascript(如果javascript不允许,则使用php)来完成此操作?

This is a pretty straightforward regex problem, but geez, it's not as straightforward as all the other answers assume. 这是一个非常简单的正则表达式问题,但是,老兄,它并没有其他所有答案都那么简单。 A few points: 几点:

  • Regex is the right choice - the split and substr answers won't deal with the leading space, and can't distinguish between a dateline with a dash at the beginning of a sentence, and a dash in the middle of your text content. 正则表达式是正确的选择- splitsubstr答案不会处理前导空格,也不能区分句子开头有短划线的日期行和文本内容中间的短划线。 Any option you use ought to be able to deal with content like: "President Nicolas Sarkozy — running from behind for reelection — came to Paris today..." as well as the options you suggest. 您使用的任何选项都应该能够处理以下内容: "President Nicolas Sarkozy — running from behind for reelection — came to Paris today..."以及您建议的选项。

  • It's tricky to automatically recognize that my test sentence above doesn't have a dateline. 自动识别我上面的测试句子没有日期线是很棘手的。 Almost all the answers so far use the single description: any number of arbitrary characters, followed by a dash . 到目前为止,几乎所有答案都使用单个描述: any number of arbitrary characters, followed by a dash That's insufficient for a test sentence like the one above. 对于上面的测试句子来说,这是不够的。

  • You'll get better results by adding a few more rules, like fewer than X characters, located at the beginning of the string, followed by a dash, optionally followed by an arbitrary number of spaces, followed by a capital letter . 通过fewer than X characters, located at the beginning of the string, followed by a dash, optionally followed by an arbitrary number of spaces, followed by a capital letter添加更多的规则(例如fewer than X characters, located at the beginning of the string, followed by a dash, optionally followed by an arbitrary number of spaces, followed by a capital letter得到更好的结果。 Even this won't work correctly with "President Sarkozy — Carla Bruni's husband..." , but you're going to have to assume that this edge case is sufficiently rare to ignore. 即使这对于"President Sarkozy — Carla Bruni's husband..."也无法正常使用,但是您将不得不假设这种极端的情况很少发生,可以忽略。

All of which gives you a function like this: 所有这些都为您提供了如下功能:

function removeDateline(str) {
    return str.replace(/^[^—]{3,75}—\s*(?=[A-Z])/, "");
}

Breaking it down: 分解:

  • ^ - must occur at the beginning of the string. ^ -必须出现在字符串的开头。
  • [^—]{3,75} - between 3 and 75 characters other than a dash [^—]{3,75} -破折号以外的3至75个字符
  • \\s* - optional spaces \\s* -可选空格
  • (?=[AZ]) - lookahead - the next character must be a capital letter. (?= [AZ])-前瞻-下一个字符必须为大写字母。

Usage: 用法:

var s = "PARIS — President Nicolas Sarkozy, running from behind for reelection...";
removeDateline(s); // "President Nicolas Sarkozy — running from behind for reelection..."

s = "PARIS — President Nicolas Sarkozy — running from behind for reelection...";
removeDateline(s);  // "President Nicolas Sarkozy — running from behind for reelection..."

s = "CARURU, Colombia — Quite suddenly, the endless green of Amazonian forest...";
removeDateline(s); // "Quite suddenly, the endless green of Amazonian forest..."

If each sentence can be separated from the others you can use a regexp. 如果每个句子都可以与其他句子分开,则可以使用正则表达式。 Like this example: 像这个例子:

var s = "PARIS — President Nicolas Sarkozy, running from behind for reelection..."
function removeWord(str)
{
    return str.replace(/^[^—]+—[\s]*/, "");
}
alert(removeWord(s));

PHP PHP

$x = "PARIS — President Nicolas Sarkozy, running from behind for reelection...";
$var = substr($x, strpos($x, "—"));

In the most basic example: 在最基本的示例中:

var str = "PARIS - President Nicolas Sarkozy, running from behind for reelection.";
alert(str.split('-')[1]);​ // outputs: President Nicolas Sarkozy, running from behind for reelection.

Based on your actual document structure there could be ways to loop through the content to speed this type of operation up. 根据您实际的文档结构,可能有一些方法可以循环浏览内容以加快此类操作的速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM