简体   繁体   中英

in PHP, is there a way to detect if a string contains any words?

The question is about detecting if a string have any words (from any languages). I'm not looking for a specific word in particular, just test if a string have real world existing words in it.

Example $str = 'allo' would return true and $str = 'zyzassk ' would return false

I tried preg_match_all('/\w/', $input_lines, $output_array); preg_math \w return each individual letters, but how to get complete words? and is there a library to test against dictionaries?

is there a way or php function to do this?

There are several ways to do it:

using str_contains: https://stackoverflow.com/a/65473395/4717133

using strpos: https://www.php.net/manual/es/function.strpos.php

the main problem is that you need a pattern match in language to know if it exists in any language; This can be a bit torturous and shouldn't be done...

but you can implement the use of some app like google translate...

https://cloud.google.com/translate/docs/basic/detecting-language

Método HTTP y URL:

POST https://translation.googleapis.com/language/translate/v2/detect

JSON body of the request:

{
  "q": "Mi comida favorita es una enchilada."
}

To submit your request you can use curl:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://translation.googleapis.com/language/translate/v2/detect

you would get a response like this:

{
  "data": {
    "detections": [
      [
        {
          "confidence": 1,
          "isReliable": false,
          "language": "es"
        }
      ]
    ]
  }
}

I did some digging and I came up with this

function is_str_have_human_words ( $txtToDetect ) {
    
    $highestLangCode = ''; 
    
    if ( preg_match('/[\x{4E00}-\x{9FBF}]/u', $txtToDetect) )   { $highestLangCode = 'zh'; }
    if ( preg_match('/[\x{3040}-\x{309F}]/u', $txtToDetect) )   { $highestLangCode = 'zh'; }
    if ( preg_match('/[\x{30A0}-\x{30FF}]/u', $txtToDetect) )   { $highestLangCode = 'zh'; }
    if ( preg_match('/[\x{3130}-\x{318F}\x{AC00}-\x{D7AF}]/u', $txtToDetect) ) { $highestLangCode = 'ko'; }
    if ( preg_match('/\p{Thai}/u', $txtToDetect) )              { $highestLangCode = 'th'; }
    if ( preg_match('/\p{Arabic}/u', $txtToDetect) )            { $highestLangCode = 'ar'; }
    if ( preg_match('/\p{Armenian}/u', $txtToDetect) )          { $highestLangCode = 'hy'; }
    if ( preg_match('/\p{Bengali}/u', $txtToDetect) )           { $highestLangCode = 'bn'; }
    if ( preg_match('/\p{Devanagari}/u', $txtToDetect) )        { $highestLangCode = 'hi'; }
    if ( preg_match('/\p{Georgian}/u', $txtToDetect) )          { $highestLangCode = 'ka'; }
    if ( preg_match('/\p{Greek}/u', $txtToDetect) )             { $highestLangCode = 'el'; }
    if ( preg_match('/\p{Gujarati}/u', $txtToDetect) )          { $highestLangCode = 'gu'; }
    if ( preg_match('/\p{Hebrew}/u', $txtToDetect) )            { $highestLangCode = 'he'; }
    if ( preg_match('/\p{Kannada}/u', $txtToDetect) )           { $highestLangCode = 'kn'; }
    if ( preg_match('/\p{Khmer}/u', $txtToDetect) )             { $highestLangCode = 'km'; }
    if ( preg_match('/\p{Lao}/u', $txtToDetect) )               { $highestLangCode = 'lo'; }
    if ( preg_match('/\p{Limbu}/u', $txtToDetect) )             { $highestLangCode = 'li'; }
    if ( preg_match('/\p{Malayalam}/u', $txtToDetect) )         { $highestLangCode = 'ml'; }
    if ( preg_match('/\p{Mongolian}/u', $txtToDetect) )         { $highestLangCode = 'mn'; }
    if ( preg_match('/\p{Myanmar}/u', $txtToDetect) )           { $highestLangCode = 'my'; }
    if ( preg_match('/\p{Oriya}/u', $txtToDetect) )             { $highestLangCode = 'or'; }
    if ( preg_match('/\p{Sinhala}/u', $txtToDetect) )           { $highestLangCode = 'si'; }
    if ( preg_match('/\p{Tagalog}/u', $txtToDetect) )           { $highestLangCode = 'tl'; }
    if ( preg_match('/\p{Tamil}/u', $txtToDetect) )             { $highestLangCode = 'ta'; }
    if ( preg_match('/\p{Telugu}/u', $txtToDetect) )            { $highestLangCode = 'te'; }
    if ( preg_match('/\p{Thaana}/u', $txtToDetect) )            { $highestLangCode = 'dv'; }
    if ( preg_match('/\p{Tibetan}/u', $txtToDetect) )           { $highestLangCode = 'bo'; }
    if ( preg_match('/[А-Яа-яЁё]/u', $txtToDetect) )            { $highestLangCode = 'ru'; }
        
    if ( $highestLangCode == '' ) {
        
        $wordsToTests = explode(strtolower($txtToDetect));
        $wordsToTests = preg_replace("/[:punct:]+/", "", $wordsToTests);
        
        foreach ( $wordsToTests as $wordsToTest ) {     
            
            // DATABASE WITH WORDS FROM LOTS OF LANGAGES            
            $uword = $mysqli->query("SELECT * FROM `langtable` WHERE `word` = '".clean($wordsToTest)."'; ");
            if ( $uword->num_rows > 0 ){ $highestLangCode = '-'; break; }
            
            // OR A SPELL CHECK LIKE pspell_check...
            
        } 
    
    }
    
    if ( $highestLangCode == '' ) { return false; } else { return true; }
    
}

If you have those words in database or in a file you can use this function:

function CheckStringForWords($text){

$words = array("words","that","you","wanna","check");
//$words array can come out from a file or even database.



$matches = array();
$matchFound = preg_match_all(
                "/\b(" . implode($words,"|") . ")\b/i", 
                $text, 
                $matches
              );

if ($matchFound) {
  $words = array_unique($matches[0]);
  foreach($words as $word) {
    return true;
    //returns true if $text contains any of the words in $words array.
  }
  return false;
  //returns false if $text does not contain any of the words in $words array.
}
}

If you are using php 8 then there is a function str_contains() Determine if a string contains a given substring

$string = 'The lazy fox jumped over the fence';

if (str_contains($string, 'lazy')) {
    echo "The string 'lazy' was found in the string\n";
}

if (str_contains($string, 'Lazy')) {
    echo 'The string "Lazy" was found in the string';
} else {
    echo '"Lazy" was not found because the case does not match';
}

Determine if a string contains a given substring

The above example will output:

The string 'lazy' was found in the string
"Lazy" was not found because the case does not match

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM