EDIT (thanks to Jezrael to lead me on a way to do it):
I have a list of phone numbers, and I'd like for every number to count the number of occurences of each digit. You'd understand that a double for
loop isn't possible as I have a few hundred thousands phone numbers to compute on.
For that i create a dataframe (because I feel it might be an effective way to do it) which contains on each row the split of the phone number:
numbers = ['0615260518','0815465948','0215616235','0415291826']
df = pd.DataFrame([list(number) for number in numbers])
Out[1]:
0 1 2 3 4 5 6 7 8 9
0 0 6 1 5 2 6 0 5 1 8
1 0 8 1 5 4 6 5 9 4 8
2 0 2 1 5 6 1 6 2 3 5
3 0 4 1 5 2 9 1 8 2 6
Then I need to fill it with the desired counts. Desired output:
Out[2]:
0 1 2 3 4 5 6 7 8 9
0615260518 2 2 1 0 0 2 2 0 1 0
0815465948 1 1 0 0 2 2 1 0 2 1
0215616235 1 2 2 1 0 2 2 0 0 0
0415291826 1 2 2 0 1 1 1 0 1 1
Here is a bloody not pythonic way to achieve it:
for num in df.index:
for col in df.columns:
df.ix[num,col] = num.count(str(col))
Way too long to compute, then Jezrael offered me this solution:
df.apply(lambda x: x.value_counts(), axis=1).fillna(0).astype(int)
Which is way better, but still too long. So I've tried to replace the value_counts which isn't designed for small dataframes:
df.apply(lambda x: digit_count(''.join(x)), axis=1)
Where:
def digit_count(number):
my_string = list(number.lower())
my_dict = []
for i in np.arange(10):
my_dict.append(my_string.count(str(i)))
return my_dict
Which is about 3 times faster. But is there any way to do it even faster (I feel this iteration isn't optimal).
You can first convert index
to_series
, because apply
does not work with index
yet. Last apply
value_counts
, fillna
and cast to int
by astype
:
a = (df.index.to_series().apply(lambda x: pd.Series(list(x))))
print (a)
0 1 2 3 4 5 6 7 8 9
0615260518 0 6 1 5 2 6 0 5 1 8
0815465948 0 8 1 5 4 6 5 9 4 8
0215616235 0 2 1 5 6 1 6 2 3 5
0415291826 0 4 1 5 2 9 1 8 2 6
print (a.apply(lambda x: x.value_counts(), axis=1).fillna(0).astype(int))
0 1 2 3 4 5 6 8 9
0615260518 2 2 1 0 0 2 2 1 0
0815465948 1 1 0 0 2 2 1 2 1
0215616235 1 2 2 1 0 2 2 0 0
0415291826 1 2 2 0 1 1 1 1 1
EDIT:
from collections import Counter
print (pd.DataFrame([x for x in a.apply(Counter, axis=1)]))
0 1 2 3 4 5 6 8 9
0 2 2 1.0 NaN NaN 2 2 1.0 NaN
1 1 1 NaN NaN 2.0 2 1 2.0 1.0
2 1 2 2.0 1.0 NaN 2 2 NaN NaN
3 1 2 2.0 NaN 1.0 1 1 1.0 1.0
Timings ( len(df)=4
):
In [288]: %timeit (a.apply(lambda x: x.value_counts(), axis=1))
100 loops, best of 3: 3.74 ms per loop
In [289]: %timeit (pd.DataFrame([x for x in a.apply(Counter, axis=1)]))
1000 loops, best of 3: 1.27 ms per loop
( len(df)=4k
):
In [296]: %timeit (pd.DataFrame([x for x in a.apply(Counter, axis=1)]))
10 loops, best of 3: 87 ms per loop
In [297]: %timeit (a.apply(lambda x: x.value_counts(), axis=1))
1 loop, best of 3: 2.45 s per loop
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.