简体   繁体   English

如何验证 python 中的单词?

[英]How to validate a word in python?

I have a list in Python like this:我在 Python 中有一个这样的列表:

`list = ['thatCreation', 'happeningso', '’', 'comebecause',]

Question:题:

I want specific words:我想要具体的话:

For e.g. -> 'thatCreation' -> 'that', 'creation'
            'happeningso' -> 'happening', 'so'
            'comebeacause' -> 'come', 'because' `

Thanks in advance for solving it in python.在此先感谢您在 python 中解决问题。

It looks like you are trying to take words merged together in camel case and break it apart.看起来您正在尝试将驼峰式的单词合并在一起并将其分开。 There is a great algorithm called Viterbi that does this really well.有一个叫做Viterbi的很棒的算法可以很好地做到这一点。

I can't explain the magic behind it, but I implemented it in my program recently and it works really well.我无法解释它背后的魔力,但我最近在我的程序中实现了它并且效果非常好。 My understanding is it calculates the probability of each word and splits on that.我的理解是它计算每个单词的概率并以此为基础进行拆分。 This algorithm can split words in any case.该算法可以在任何情况下拆分单词。

def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = Counter(words(open(words_path).read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

sentence = ' '.join(viterbi_segment('thatCreation'.lower())[0])
print('sentence: {0}'.format(sentence))
word = ''.join(a.capitalize() for a in split('([^a-zA-Z0-9])', sentence)
       if a.isalnum())
print('word: {0}'.format(word[0].lower() + word[1:]))

You need a dictionary of a ton of words, there are multiple out there, but I used: https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-no-swears.txt你需要一本包含大量单词的字典,那里有很多,但我使用了: https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-no-swears。文本文件

and updated it with new words that it didn't have.并用它没有的新词更新它。

Borrowed from Peter Norvig's pytudes to perform word segmentation.借用 Peter Norvig 的 pytudes 来进行分词。 Please try..请试试..

import re
import math
import random
import matplotlib.pyplot as plt
from collections import Counter
from itertools   import permutations
from typing      import List, Tuple, Set, Dict, Callable

!wget https://raw.githubusercontent.com/dwyl/english-words/master/words.txt

Word = str    # We implement words as strings
cat = ''.join # Function to concatenate strings together


def tokens(text) -> List[Word]:
    """List all the word tokens (consecutive letters) in a text. Normalize to lowercase."""
    return re.findall('[a-z]+', text.lower()) 

TEXT = open('big.txt').read()
WORDS = tokens(TEXT)


class ProbabilityFunction:
    def __call__(self, outcome):
        """The probability of `outcome`."""
        if not hasattr(self, 'total'):
            self.total = sum(self.values())
        return self[outcome] / self.total
    
class Bag(Counter, ProbabilityFunction): """A bag of words."""
    

Pword = Bag(WORDS)


def Pwords(words: List[Word]) -> float:
    "Probability of a sequence of words, assuming each word is independent of others."
    return Π(Pword(w) for w in words)

def Π(nums) -> float:
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for num in nums:
        result *= num
    return result

def splits(text, start=0, end=20) -> Tuple[str, str]:
    """Return a list of all (first, rest) pairs; start <= len(first) <= L."""
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), end)+1)]

def segment(text) -> List[Word]:
    """Return a list of words that is the most probable segmentation of text."""
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest)
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

strings = ['thatCreation', 'happeningso', 'comebecause']
[segment(string.lower()) for string in strings]

--2020-08-04 18:48:06-- https://raw.githubusercontent.com/dwyl/english-words/master/words.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected. --2020-08-04 18:48:06-- https://raw.githubusercontent.com/dwyl/english-words/master/words.txt解析raw.githubusercontent.com (raw.githubusercontent.com).. . 151.101.0.133, 151.101.64.133, 151.101.128.133, ... 正在连接到 raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... 已连接。 HTTP request sent, awaiting response... 200 OK Length: 4863005 (4.6M) [text/plain] Saving to: 'words.txt.2' HTTP 请求已发送,正在等待响应... 200 OK 长度:4863005 (4.6M) [text/plain] Saving to: 'words.txt.2'

words.txt.2 100%[===================>] 4.64M 162KB/s in 25s words.txt.2 100%[===================>] 4.64M 162KB/s 25s

2020-08-04 18:48:31 (192 KB/s) - 'words.txt.2' saved [4863005/4863005] 2020-08-04 18:48:31 (192 KB/s) - 'words.txt.2' 已保存 [4863005/4863005]

[['that', 'creation'], ['happening', 'so'], ['come', 'because']] [['that', 'creation'], ['happening', 'so'], ['come', 'because']]

import re
from collections import Counter

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                    for j in range(max(0, i - max_word_length), i))
    probs.append(prob_k)
    lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]
    

def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower())   
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))  
total = float(sum(dictionary.values()))
l = ['thatCreation', 'happeningso', 'comebecause',]

for w in l:
    print(viterbi_segment(w.lower()))

O/p will be - 
(['that', 'creation'], 1.63869514118246e-07)
(['happening', 'so'], 1.1607123777400279e-07)
(['come', 'because'], 4.81658105705814e-07)

I got a solution to my problem from @Darius Bacon and for this, you need to make all strings a lowercase string.我从@Darius Bacon 那里得到了我的问题的解决方案,为此,您需要将所有字符串设为小写字符串。 Thank You Guys for your help.谢谢你们的帮助。

Visit this link for download big.txt: https://norvig.com/big.txt访问此链接下载 big.txt: https://norvig.com/big.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM