简体   繁体   中英

Python - Confusion regarding how strings are stored and processed in Python

I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.

  • Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings? If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'> . However, when I did us = unicode(s) , us was of the type <type 'unicode'> . Is us a str type or is there actually a unicode type in python?

  • Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>" .

  • What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?

>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
  • How does str(object) work? I read that it will try to execute the str () function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?

Thanks in advance.

Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.

In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, ie decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes .

While it's not a direct answer to your question, have a look at these examples:

import sys

e = ''
print(len(e))            # 0
print(sys.getsizeof(e))  # 49

a = 'hello'
print(len(a))            # 5
print(sys.getsizeof(a))  # 54

u = 'hello平仮名'
print(len(u))                 # 8
print(sys.getsizeof(u))       # 90
print(len(u[1:]))             # 7
print(sys.getsizeof(u[1:]))   # 88
print(len(u[:-1]))            # 7
print(sys.getsizeof(u[:-1]))  # 88
print(len(u[:-2]))            # 6
print(sys.getsizeof(u[:-2]))  # 86
print(len(u[:-3]))            # 5
print(sys.getsizeof(u[:-3]))  # 54
print(len(u[:-4]))            # 4
print(sys.getsizeof(u[:-4]))  # 53

j = 'hello😋😋😋'
print(len(j))                 # 8
print(sys.getsizeof(j))       # 108
print(len(j[:-1]))            # 7
print(sys.getsizeof(j[:-1]))  # 104
print(len(j[:-2]))            # 6
print(sys.getsizeof(j[:-2]))  # 100

Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:

  • Empty string object has an overhead of 49 bytes.
  • String with ASCII symbols of length 5 has size 49 + 5. Ie the encoding uses 1 byte per symbol.
  • String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even though the length is still 8.
  • The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes . Ie the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings , but we pay with an extra memory overhead.
  • Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.

Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:

j = 'hello😋😋😋'
b = j.encode('utf8')  # b'hello\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b'    
print(len(b))  # 17

So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.

And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM