简体   繁体   中英

How can i use sed with unicode character

function change() {
  for i in {0..28}
  do
    echo ",${cryp_data_letter[$i]}" "${org_data[$i]}"
    sed -i "s/,${cryp_data_letter[$i]}/${org_data[$i]}/g" "./temp.txt"
    #cat "./temp.txt"
  done
}

I have a function that change the some character in the temp.txt by a spesific rule but some kind of character like ı,ğ,ö etc. change with empty string. I suppose the cause of the trouble is UTF-8 so how can i apply sed with unicode? or any other suggestion for --> "sed -i "s/,${cryp_data_letter[$i]}/${org_data[$i]}/g" "./temp.txt""

Here is the given file temp.txt:

abc ğhıi
def
jkl
oöpr
uü vy z
çgm ns
şt

and output:

IDK ,ğS,ıT
NMY
BOÜ
G,öHÇ
P,ü ÖF ,
,çUŞ ZĞ
,şV

By the way, in return process i will change all letter with lower case and put "," before the all letter so it will become before the sed:

,a,b,c ,ğ,h,ı,i
,d,e,f
,j,k,l
,o,ö,p,r
,u,ü ,v,y ,z
,ç,g,m ,n,s
,ş,t

LOCALE:

LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=tr_TR.UTF-8
LC_TIME=tr_TR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=tr_TR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=tr_TR.UTF-8
LC_NAME=tr_TR.UTF-8
LC_ADDRESS=tr_TR.UTF-8
LC_TELEPHONE=tr_TR.UTF-8
LC_MEASUREMENT=tr_TR.UTF-8
LC_IDENTIFICATION=tr_TR.UTF-8
LC_ALL=

Sorry for the non-answer, but I'm unable to reproduce your issue.

Here's your code in a completely self-contained script (please do this yourself next time):

#!/bin/bash

if [[ ö != $'\xC3\xB6' ]]
then
  echo "You didn't save this file as UTF-8"
  exit 1
fi

function change() {
  for i in {0..28}
  do
#    echo ",${cryp_data_letter[$i]}" "${org_data[$i]}"
    sed -i "s/,${cryp_data_letter[$i]}/${org_data[$i]}/g" "./temp.txt"
    #cat "./temp.txt"
  done
}

# Shift all characters one letter ahead in the alphabet
cryp_data_letter=({a..z} ğ ö ı)
org_data=({b..z} ğ ö ı a)

# Create the file as you say it is before the sed
cat > temp.txt << "EOF"
,a,b,c ,ğ,h,ı,i
,d,e,f
,j,k,l
,o,ö,p,r
,u,ü ,v,y ,z
,ç,g,m ,n,s
,ş,t
EOF

change

cat temp.txt

When I run ./testscript I get this output:

bcd öiaj
efg
klm
pıqs
v,ü wz ğ
,çhn ot
,şu

As you can see, the letters including the ö and ğ are being replaced and inserted just fine.

There are multiple issues here, which may each in isolation or combination cause your issue.

  • We can't know which character set and encoding you use. Your locale is correctly set up for UTF-8 but your terminal and other software might not be interoperating correctly. Perhaps see also the Stack Overflow character-encoding tag info page for some background and diagnostics.
  • Even if your system and utilities are generally UTF-8 compatible, there is no guarantee that your sed is. Many sed variants are still oblivious to Unicode, and there is no stable proposal for what exactly the behavior should be. Sometimes it makes sense to switch to a different language; many trivial sed scripts can easily be ported to run under perl -CSD -p with little or no changes.
  • Even if everything else is working correctly, Unicode provides multiple ways to represent many accented characters. If your data contains ö as a single code point U+00E6 but your script contains the corresponding decomposed sequence, or vice versa, your sed script (probably) won't replace the alternate representation. Look for Unicode normalization.

With that out of the way, if the second point is sufficient, the following might actually work.

perl -CSD -pi~ e 'tr/AEİR...FJ/ABCÇ...YZ/' ./temp.txt

Notice the -i~ option to do in-place editing but save a backup file. I have little confidence that this will work right off the bat without some modifications and probably clarifications from your side.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM