简体   繁体   中英

How to map strings made up of printable characters to ints

I have downloaded a book from project Gutenberg. For a coding project I need to map each word to a positive integer. The letters in the words are all printable but the full alphabet used is of size 75 for this book. This includes punctuation.

How can I map each word to an integer? The same word should always be mapped to the same integer but different words should be mapped to different integers.

The input is a list of words. For example:

'[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER',...

Ideally I would like to avoid reading in the whole input just to map the first word, for instance.

from itertools import count

class WordMap:
    def __init__(self):
        self._words = {}
        self._counter = count()
    
    def add(self, word):
        if word not in self._words:
            self._words[word] = next(self._counter) 
    
    def __getitem__(self, word):
        return self._words[word]
    
    def __repr__(self):
        return repr(self._words)

Demo:

>>> wm = WordMap()
>>> wm.add('Emma')
>>> wm
{'Emma': 0}
>>> wm.add('test')
>>> wm
{'Emma': 0, 'test': 1}
>>> wm.add('Emma')
>>> wm
{'Emma': 0, 'test': 1}

Some tweaks depending on your use case might be in order. For example, you could only add the lowercase version of words to self._words if you want a WordMap to be case insensitive.

I would do it following way (if big numbers are not problems), encode it using ascii (if your text is limited to ASCII character) or utf-8 then treat these bytes as integer, ie:

def get_code(x):
    return int.from_bytes(x.encode('ascii'), 'big')
words = ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']
for w in words:
    print(w, get_code(w))

output

[ 91
Emma 1164799329
by 25209
Jane 1247899237
Austen 71972703987054
1816 825766198
] 93
VOLUME 94898583063877
I 73
CHAPTER 18938268797388114

Keep in mind that this method will not yield smallest values possible. Another possibilty if you know all characters in advance is treat them as digits after providing order. Consider simpler example: getting codes for words consisting of uppercase ASCII letters, as there are 26 it means using base-26 system, so

chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
def get_code(word):
    values = [chars.index(i)+1 for i in word]
    return sum(v*26**inx for inx, v in enumerate(values))
words = ['EMMA', 'BY', 'JANE', 'AUSTEN', 'VOLUME', 'I', 'CHAPTER']
for w in words:
    print(w, get_code(w))

output

EMMA 26707
BY 652
JANE 97380
AUSTEN 168989055
VOLUME 65725188
I 9
CHAPTER 5629312471

You might elect to use different endianess.

Using set and enumerate :

words = ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']
{word: idx for idx, word in enumerate(set(words))}

You will need to read in the whole input to know all the unique words in your text.

Edit

Added removal of punctuation and lowercasing the text.

from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
import string

text = strip_headers(load_etext(158)).strip() # download http://www.gutenberg.org/ebooks/158
punctuation = string.punctuation + '“”'
text = text.lower().translate(str.maketrans(punctuation, ' ' * len(punctuation))) # strip punctuation 
words = sorted(set(text.split()))
{word: idx for idx, word in enumerate(words)}

Output:

{'000': 0,
 '10': 1,
 '23rd': 2,
 '24th': 3,
 '26th': 4,
 '28th': 5,
 '7th': 6,
 '8th': 7,
 'a': 8,
 'abbey': 9,
 'abbots': 10,
 'abdy': 11,
 'abhor': 12,
 'abhorred': 13,
 'abide': 14,
 'abilities': 15,
 'able': 16,
 'abode': 17,
 'abolition': 18,
 'abominable': 19,
 'about': 20,
 ...
 'you': 7088,
 'young': 7089,
 'younger': 7090,
 'youngest': 7091,
 'your': 7092,
 'yours': 7093,
 'yourself': 7094,
 'youth': 7095,
 'youthful': 7096,
 'zeal': 7097,
 'zigzags': 7098}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM