简体   繁体   中英

How to extract real words from a code that generates a random set of letters

I wanna find out the average number of real words that would show up in a set of randomly generated letters. is there a pythonic way to do this?

I've managed to figure out how to generate a set of 1000 random letters 1000 times but i have no idea on how to go about counting the numbers of real word effciently.

This is what I have so far

Potato=0

import string
import random
def text_gen(size=100, chars=string.ascii_uppercase + string.ascii_lowercase):
    return ''.join(random.choice(chars) for _ in range(size))

while True:
    print (text_gen(1000))
    Potato=Potato+1
    if Potato==1001:
        break

From the string generated, how would I be able to filter out only the parts that make sense?

You can take a different route; divide the amount of words in by the possible combinations.

From a dictionary make a set of words for a given length, eg 6 letters:

with open('words.txt') as words:
    six_letters = {word for word in words.read().splitlines()
                   if len(word) == 6}

The amount of six letter words is len(six_letters) .

The amount of combinations of six lowercase letters is 26 ** 6 .

So the probability of getting a valid six letter word is:

len(six_letters) / 26 ** 6

edit: Python 2 uses floor division so will give you 0 .

You can convert either the numerator or denominator to a float to get a non-zero result, eg:

len(six_letters) / 26.0 ** 6

Or you can make your Python 2 code behave like Python 3 by importing from the future:

from __future__ import division

len(six_letters) / 26 ** 6

Which, with your word list , both give us:

9.67059707562e-05

The amount of 4 letter words is 7185 . There's a nice tool for collecting histogram data in the standard library, collections.Counter :

from collections import counter
from pprint import pprint

with open(words_file) as words:
    counter = Counter(len(word.strip()) for word in words)

pprint(counter.items())

The values from your file give:

[(1, 26),
 (2, 427),
 (3, 2130),
 (4, 7185),
 (5, 15918),
 (6, 29874),
 (7, 41997),
 (8, 51626),
 (9, 53402),
 (10, 45872),
 (11, 37538),
 (12, 29126),
 (13, 20944),
 (14, 14148),
 (15, 8846),
 (16, 5182),
 (17, 2967),
 (18, 1471),
 (19, 760),
 (20, 359),
 (21, 168),
 (22, 74),
 (23, 31),
 (24, 12),
 (25, 8),
 (27, 3),
 (28, 2),
 (29, 2),
 (31, 1)]

So, most words, 53402 , in your dictionary have 9 letters. There are roughly twice as many 5 as 4 letter, and twice as many 6 as 5 letter words.

It is up to you to define what real words are > create your own list of words. I made the following solution with your comment as random string:

dictionary = ['fire', 'phone']
random_string = 'gdlkfghiwmfefirekjfewlklphonelkfdlfk'
total_words = 0
for word in dictionary:
    total_words += random_string.count(word)
print(total_words)

>>> 2

Which can be refactored into the following code where you create a list with the count of each word in your dictionary and then get a sum of all these counts:

dictionary = ['fire', 'phone']
random_string = 'gdlkfghiwmfefirekjfewlklphonelkfdlfk'
total_words = sum([random_string.count(word) for word in dictionary]) # List comprehension to create a list, then sum the content of the list
print(total_words)

>>> 2

Well combine each generated word with a request on https://developer.oxforddictionaries.com/ they have an API which may be useful for your purposes and the also have a basic python example using requests. Or you may find any other API for example Google translate API and check for error returns (i personally have not used any and i do not know what they return if you have a misspelled word but it should not be hard to find out)

Last but not least use requests and beautiful soup to send requests to a dictionary page and read the results. (the best would be to request google translate but it will block you after few results)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM