简体   繁体   English

使用php从文本中提取单词

[英]Extracting words from a text using php

Hello friends have a little problem. 你好朋友有点问题。 I need to extract only the words of a text "anyone". 我只需要提取文本“任何人”的单词。

I tried to retrieve the words using strtok (), strstr (). 我尝试使用strtok(),strstr()检索单词。 some regular expressions, but only managed to extract some words. 一些正则表达式,但只设法提取一些单词。

The problem is complex due to the number of characters and symbols that can accompany the words. 由于可以伴随单词的字符和符号的数量,问题是复杂的。

The example text which must be extracted words. 必须提取单词的示例文本。 This is a sample text: 这是一个示例文本:

Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and http://www.google.com (r) The 509th "composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters).

Sample text, for testing.

The result of extracting the text should be: 提取文本的结果应该是:

Main article our required but March Gutenberg's a go or and The composite and dog as is done article agriculture cat now Hi meters

Sample text for testing

The first function I wrote to facilitate the work 我写的第一个函数是为了方便工作

function PreText($text){
  $text = str_replace("\n", ".", $text);
  $text = str_replace("\r", ".", $text);

  $text = str_replace("'", "", $text);
  $text = str_replace("?", "", $text);
  $text = str_replace("¿", "", $text);
  $text = str_replace("(", "", $text);
  $text = str_replace(")", "", $text);
  $text = str_replace('"', "", $text);
  $text = str_replace(';', "", $text);
  $text = str_replace('!', "", $text);
  $text = str_replace('<', "", $text);
  $text = str_replace('>', "", $text);
  $text = str_replace('#', "", $text);

  $text = str_replace(",", "", $text);

  $text = str_replace(".c", "", $text);
  $text = str_replace(".C", "", $text);
  return $text;
}

Split function: 分割功能:

function SplitWords($text){
  $words = explode(" ", $text);
  $ContWords = count($words);

  for ($i = 0; $i < $ContWords; $i++){
    if (ctype_alpha($words[$i])) {
      $NewText .= $words[$i].", ";
    }
  }
  return $NewText;
}

The program: 该程序:

<?
  include_once ('functions.php');

  $text = "Main article: our 46,000 ...";
  $text = PreText($text);
  $text = SplitWords($text);
  echo $text;
?>

Is that the code has a long way. 是代码还有很长的路要走。 We appreciate your help. 感谢您的帮助。

If I understand you correctly, you want to remove all non-letters from the string. 如果我理解正确,您要删除字符串中的所有非字母。 I would use preg_replace 我会使用preg_replace

$text = "Main article: our 46,000...";
$text = preg_replace("/[^a-zA-Z' ]/","",$text);

This should remove everything that is not a letter, apostrophe or a space. 这应该删除所有不是字母,撇号或空格的东西。

Try this almost your requirement 试试这几乎是你的要求

<?php
$text = <<<HEREDOC
Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and
        http://www.google.com (r) The 509th composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters). Sample text, for testing.
HEREDOC;
//replace all kind of URLs and emails from text
$url_email = "((https?|ftp)\:\/\/)?"; // SCHEME
$url_email .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$url_email .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$url_email .= "(\:[0-9]{2,5})?"; // Port
$url_email .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$url_email .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$url_email .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor

$text = preg_replace("/$url_email/","",$text);
//replace anything like Us: @unknown
$text = preg_replace("/Us:.?@\\w+/","",$text);
//replace all Non-Alpha characters
$text = preg_replace("/[^a-zA-Z' ]/","",$text);
echo $text;
?>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM