The question is about detecting if a string have any words (from any languages). I'm not looking for a specific word in particular, just test if a string have real world existing words in it.
Example $str = 'allo'
would return true and $str = 'zyzassk
' would return false
I tried preg_match_all('/\w/', $input_lines, $output_array);
preg_math \w return each individual letters, but how to get complete words? and is there a library to test against dictionaries?
is there a way or php function to do this?
There are several ways to do it:
using str_contains: https://stackoverflow.com/a/65473395/4717133
using strpos: https://www.php.net/manual/es/function.strpos.php
the main problem is that you need a pattern match in language to know if it exists in any language; This can be a bit torturous and shouldn't be done...
but you can implement the use of some app like google translate...
https://cloud.google.com/translate/docs/basic/detecting-language
Método HTTP y URL:
POST https://translation.googleapis.com/language/translate/v2/detect
JSON body of the request:
{
"q": "Mi comida favorita es una enchilada."
}
To submit your request you can use curl:
curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://translation.googleapis.com/language/translate/v2/detect
you would get a response like this:
{
"data": {
"detections": [
[
{
"confidence": 1,
"isReliable": false,
"language": "es"
}
]
]
}
}
I did some digging and I came up with this
function is_str_have_human_words ( $txtToDetect ) {
$highestLangCode = '';
if ( preg_match('/[\x{4E00}-\x{9FBF}]/u', $txtToDetect) ) { $highestLangCode = 'zh'; }
if ( preg_match('/[\x{3040}-\x{309F}]/u', $txtToDetect) ) { $highestLangCode = 'zh'; }
if ( preg_match('/[\x{30A0}-\x{30FF}]/u', $txtToDetect) ) { $highestLangCode = 'zh'; }
if ( preg_match('/[\x{3130}-\x{318F}\x{AC00}-\x{D7AF}]/u', $txtToDetect) ) { $highestLangCode = 'ko'; }
if ( preg_match('/\p{Thai}/u', $txtToDetect) ) { $highestLangCode = 'th'; }
if ( preg_match('/\p{Arabic}/u', $txtToDetect) ) { $highestLangCode = 'ar'; }
if ( preg_match('/\p{Armenian}/u', $txtToDetect) ) { $highestLangCode = 'hy'; }
if ( preg_match('/\p{Bengali}/u', $txtToDetect) ) { $highestLangCode = 'bn'; }
if ( preg_match('/\p{Devanagari}/u', $txtToDetect) ) { $highestLangCode = 'hi'; }
if ( preg_match('/\p{Georgian}/u', $txtToDetect) ) { $highestLangCode = 'ka'; }
if ( preg_match('/\p{Greek}/u', $txtToDetect) ) { $highestLangCode = 'el'; }
if ( preg_match('/\p{Gujarati}/u', $txtToDetect) ) { $highestLangCode = 'gu'; }
if ( preg_match('/\p{Hebrew}/u', $txtToDetect) ) { $highestLangCode = 'he'; }
if ( preg_match('/\p{Kannada}/u', $txtToDetect) ) { $highestLangCode = 'kn'; }
if ( preg_match('/\p{Khmer}/u', $txtToDetect) ) { $highestLangCode = 'km'; }
if ( preg_match('/\p{Lao}/u', $txtToDetect) ) { $highestLangCode = 'lo'; }
if ( preg_match('/\p{Limbu}/u', $txtToDetect) ) { $highestLangCode = 'li'; }
if ( preg_match('/\p{Malayalam}/u', $txtToDetect) ) { $highestLangCode = 'ml'; }
if ( preg_match('/\p{Mongolian}/u', $txtToDetect) ) { $highestLangCode = 'mn'; }
if ( preg_match('/\p{Myanmar}/u', $txtToDetect) ) { $highestLangCode = 'my'; }
if ( preg_match('/\p{Oriya}/u', $txtToDetect) ) { $highestLangCode = 'or'; }
if ( preg_match('/\p{Sinhala}/u', $txtToDetect) ) { $highestLangCode = 'si'; }
if ( preg_match('/\p{Tagalog}/u', $txtToDetect) ) { $highestLangCode = 'tl'; }
if ( preg_match('/\p{Tamil}/u', $txtToDetect) ) { $highestLangCode = 'ta'; }
if ( preg_match('/\p{Telugu}/u', $txtToDetect) ) { $highestLangCode = 'te'; }
if ( preg_match('/\p{Thaana}/u', $txtToDetect) ) { $highestLangCode = 'dv'; }
if ( preg_match('/\p{Tibetan}/u', $txtToDetect) ) { $highestLangCode = 'bo'; }
if ( preg_match('/[А-Яа-яЁё]/u', $txtToDetect) ) { $highestLangCode = 'ru'; }
if ( $highestLangCode == '' ) {
$wordsToTests = explode(strtolower($txtToDetect));
$wordsToTests = preg_replace("/[:punct:]+/", "", $wordsToTests);
foreach ( $wordsToTests as $wordsToTest ) {
// DATABASE WITH WORDS FROM LOTS OF LANGAGES
$uword = $mysqli->query("SELECT * FROM `langtable` WHERE `word` = '".clean($wordsToTest)."'; ");
if ( $uword->num_rows > 0 ){ $highestLangCode = '-'; break; }
// OR A SPELL CHECK LIKE pspell_check...
}
}
if ( $highestLangCode == '' ) { return false; } else { return true; }
}
If you have those words in database or in a file you can use this function:
function CheckStringForWords($text){
$words = array("words","that","you","wanna","check");
//$words array can come out from a file or even database.
$matches = array();
$matchFound = preg_match_all(
"/\b(" . implode($words,"|") . ")\b/i",
$text,
$matches
);
if ($matchFound) {
$words = array_unique($matches[0]);
foreach($words as $word) {
return true;
//returns true if $text contains any of the words in $words array.
}
return false;
//returns false if $text does not contain any of the words in $words array.
}
}
If you are using php 8 then there is a function str_contains() Determine if a string contains a given substring
$string = 'The lazy fox jumped over the fence';
if (str_contains($string, 'lazy')) {
echo "The string 'lazy' was found in the string\n";
}
if (str_contains($string, 'Lazy')) {
echo 'The string "Lazy" was found in the string';
} else {
echo '"Lazy" was not found because the case does not match';
}
Determine if a string contains a given substring
The above example will output:
The string 'lazy' was found in the string
"Lazy" was not found because the case does not match
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.