简体   繁体   中英

Check the percentage of numeric content in a word - Python

I want to check the percentage of numeric content in a particular string. For example,

Words = ['p2', 'p23','pp34','ppp01932','boss']

When the input is such as that, output should be,

output 
0.5
0.67
0.5
0.625
0.0

the quantification behind the output is, for p2, the number of numeric content is 1 and total length is 2. therefore 0.5. Likewise I want to find the output for all the entries.

I have tried the following,

float(sum(c.isdigit() for c in words[i])) / float(len(words[i]))

This is working fine but it is very inefficient and also when I run it using pyspark, I am getting errors such as jvm errors. I am looking for a efficient way to find this out so that I can run it in scale data for a data set of ~2 Billion records.

Any help would be appreciated.

Thanks

it's worked for me you must use regular expression in python just import re and because re writed in c its speed is very good

 for i in Words:
    print float(len(''.join(re.findall('\d',i))))/float(len(i))

with re.findall('\\d',i) you can find all numbers in each of your list's elements and with len() you can get size of it in according to the results if you have 1000 words with length ~100 or upper regex seems best way for you

"Inefficient" is something you test for, not guess at. I ran several variations on this ( isdigit() , re.sub() , etc.) and only 2 things were faster than your code: getting rid of the unnecessary float() , and not using the i index.

EG

import timeit

words = ['p2', 'p23','pp34','ppp01932','boss']

def isdigsub():
    for i in range(len(words)):
        float(sum(c.isdigit() for c in words[i])) / float(len(words[i]))

def isdigsub2():
    for i in range(len(words)):
        sum(c.isdigit() for c in words[i]) / len(words[i])

def isdigsub3():
    for w in words:
        sum(c.isdigit() for c in w) / len(w)

def isdigsub4():
    # From user Hamms
    for w in words:
        len([c for c in w if c.isdigit()]) / len(w)

if __name__ == '__main__':

    print(timeit.timeit('isdigsub()', setup="from __main__ import isdigsub", number=10000))
    print(timeit.timeit('isdigsub2()', setup="from __main__ import isdigsub2", number=10000))
    print(timeit.timeit('isdigsub3()', setup="from __main__ import isdigsub3", number=10000))
    print(timeit.timeit('isdigsub4()', setup="from __main__ import isdigsub4", number=10000))

On a pokey old Cubox produced:

0.7179876668378711
0.5230729999020696
0.4444526666775346
0.3233160013332963

Aaaand Hamms is in the lead with the best time so far. Barkeep! List comprehensions for everyone!

So many interesting approaches proposed here, and based on some fiddling around it looks like the relatives times of each can fluctuate quite a bit based on the lengths of the words being considered.

Let's grab some of the proposed solutions to test:

def original(words):
    [sum(c.isdigit() for c in word) / float(len(word)) for word in words]


def filtered_list_comprehension(words):
    [len([c for c in word if c.isdigit()]) / len(word) for word in words]


def regex(words):
    [len("".join(re.findall("\d", word))) / float(len(word)) for word in words]


def native_filter(words):
    [len(filter(str.isdigit, word)) / float(len(word)) for word in words]


def native_filter_with_map(words):
    map(lambda word: len(filter(str.isdigit, word))/float(len(word)), words)

And test them each with varying word lengths. Times are in seconds. Testing with 1000 words of length 10:

                    original:       1.976
 filtered_list_comprehension:       1.224
                       regex:       2.575
               native_filter:       1.209
      native_filter_with_map:       1.264

Testing with 1000 words of length 20:

                    original:       3.044
 filtered_list_comprehension:       2.032
                       regex:       3.205
               native_filter:       1.947
      native_filter_with_map:       2.034

Testing with 1000 words of length 30:

                    original:       4.115
 filtered_list_comprehension:       2.819
                       regex:       3.889
               native_filter:       2.708
      native_filter_with_map:       2.734

Testing with 1000 words of length 50:

                    original:       6.294
 filtered_list_comprehension:       4.313
                       regex:       4.884
               native_filter:       4.134
      native_filter_with_map:       4.171

Testing with 1000 words of length 100:

                    original:       11.638
 filtered_list_comprehension:       8.130
                       regex:       7.756
               native_filter:       7.858
      native_filter_with_map:       7.790

Testing with 1000 words of length 500:

                    original:       55.100
 filtered_list_comprehension:       38.052
                       regex:       28.049
               native_filter:       37.196
      native_filter_with_map:       37.209

From this I would conclude that if your "words" being tested can be up to 500 characters or so long, a regex seems to work well. Otherwise, filter ing with str.isdigit seems to be the best approach for a variety of lengths.

Your code actually didn't work for me. This seems equivalent though, maybe it'll help.

words = ['p2', 'p23','pp34','ppp01932','boss']
map(lambda v: sum(v)/float(len(v)) , map(lambda v: map(lambda u: u.isdigit(), v),  words))
##[0.5, 0.6666666666666666, 0.5, 0.625, 0.0]

Try this:

Words = ['p2', 'p23','pp34','ppp01932','boss']

def get_digits(string):
    c = 0
    for i in string:
        if i.isdigit():
            c+=1
    return c
for item in Words:
    print(round(float(get_digits(item))/len(item), 2))

Note this has been addapted from Benjamin Wohlwends answer to this question

Hint: you can speed up your code by replacing builtin lookups with local name lookups.

This is the fastest solution for me:

def count(len=len):
    for word in words:
        len([c for c in word if c.isdigit()]) / len(word)

This is basically Hamms's filtered_list_comprehension / Peter's isdigsub4 with the len=len optimization.

With this trick your opcode will only use LOAD_FAST instead of LOAD_GLOBAL . This gave me a 3.6% speedup. Not much, but better than nothing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM