简体   繁体   中英

Generating SEO-friendly URL Slug from Urdu URL

Hi I have a web site which has all the url are in seo now I'm transfer my site to urdu language but because the url is in urdu its not display the correct url does anyone have a seo function which i can use.

My site url is now like this domain.com/123// it should be like this domain.com/123/ع وأنا لا أعرف من أين أستطيع أن أراك/

this is the code i have at the moment.

function seoUrl($input)
    {
    /** 
    * Return URL-Friendly string slug
    * @param string $input 
    * @return string 
    */
        $input = remove_accent( $input );
        $input = str_replace(" ", " ", $input);
        $input = str_replace(array("'", "-"), "", $input); //remove single quote and dash
        $input = mb_convert_case($input, MB_CASE_LOWER, "UTF-8"); //convert to lowercase
        $input = preg_replace("#[^a-zA-Z]+#", "-", $input); //replace everything non an with dashes
        $input = preg_replace("#(-){2,}#", "$1", $input); //replace multiple dashes with one
        $input = trim($input, "-"); //trim dashes from beginning and end of string if any
        return $input;
    }

    function remove_accent( $str )
    {
        $a = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 
                    'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 
                    'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 
                    'C', 'c', 'C', 'c', 'D', 'd', 'Ð', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 
                    'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', '?', '?', 'J', 'j', 'K', 'k', 
                    'L', 'l', 'L', 'l', 'L', 'l', '?', '?', 'L', 'l', 'N', 'n', 'N', 'n', 'N', 'n', '?', 'O', 'o', 'O', 'o', 'O', 'o', 
                    'Œ', 'œ', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'Š', 'š', 'T', 't', 'T', 't', 'T', 't', 'U', 
                    'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Ÿ', 'Z', 'z', 'Z', 'z', 'Ž', 'ž', '?', 
                    'ƒ', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', '?', '?', 
                    '?', '?', '?', '?');

        $b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 
                   'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 
                   'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 
                   'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 
                   'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 
                   'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 
                   'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 
                   'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 
                   'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');

        return str_replace($a, $b, $str);
    }

The problem is what @deceze pointed out. Urls can only contain characters within the Latin alphabet (actually, within the English alphabet), so your only way to use Urdu in your url would be to have your best approach with English letters.

For example, I speak Catalan, and, a part from having accents, we have got this letter: ç . It is almost a c , but it sounds like an s , so when slugging a text with ç (for example, Març), I'd go for either Marc (character similarity) or Mars (phonetical similarity). You could follow this pattern. Otherwise, I think there is nothing else you could do.

EDIT: After a fast class in url encoding, you all should read the comments below this answer.

I turned to completely read your functions, and I think I happen to understand what's going on "behind the scenes":

You get your Urdu string, say the one you put before: ع وأنا لا أعرف من أين أستطيع أن أراك

  1. You pass it to remove_accent() . It doesn't contain any of the Urdu characters to be replaced for others without accents, so it returns the same string: ع وأنا لا أعرف من أين أستطيع أن أراك .
  2. You ensure there are no strange symbols by replacing them. In this case, the string would remain as is ع وأنا لا أعرف من أين أستطيع أن أراك .
  3. You convert all the characters to lowercase. I don't know Urdu, so I'm not sure if anything would happen here, so I'll leave this as is: ع وأنا لا أعرف من أين أستطيع أن أراك . And here comes the problem
  4. You replace anything different from latin alphabet letters into dashes. In this case, it would look like something like this: ------------------------------------ .
  5. You replace any group of 2 or more dashes with a single dash: - .
  6. You, finally, trim that single dash. (empty) .

So, the main problem you had was the first regex function. I'm not aware on how to fix that. Probably with a trick converting all those characters into ASCII and then creating a regex trying to fix that. However, I'd go with these steps:

  1. Clean the string from symbols like _., !'?& and turn them into - .
  2. Remove the repeated items.
  3. Lowercase the string
  4. Convert the string into something readable by the browser ( utf8_decode() would probably suffice, but I haven't tried it)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM