I have a pd.Series element of strings, separated by '_'
, with only two elements in it.
for instance,
s = pd.Series([a_1, a_2, a_3, b_1])
the command s.str.split("_")
will return a series of lists
0 ['a', '1']
1 ['a', '2']
2 ['a', '3']
3 ['b', '1']
the command s.str.partition("_", expand=False)
will return a series of tuples, where _
will be the second element in the tuple
0 ('a', '_', '1')
1 ('a', '_', '2')
2 ('a', '_', '3')
3 ('b', '_', '1')
Is there a clean (and fast) way to create a series of tuples without _
in it:
0 ('a', '1')
1 ('a', '2')
2 ('a', '3')
3 ('b', '1')
I can always do: s.str.split("_").apply(tuple)
, but apply is always slower than built-in functions (like str.split
...)
One idea is use list comprehension:
s = pd.Series('a_1, a_2, a_3, b_1'.split(', '))
#4k rows
s = pd.concat([s] * 1000, ignore_index=True)
In [195]: %timeit s.str.split("_").apply(tuple)
2.49 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [196]: %timeit [tuple(x.split('_')) for x in s]
1.46 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [197]: %timeit pd.Index(s).str.split("_", expand=True).tolist()
4.31 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
s = pd.Series('a_1, a_2, a_3, b_1'.split(', '))
#400k rows
s = pd.concat([s] * 100000, ignore_index=True)
In [199]: %timeit s.str.split("_").apply(tuple)
252 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [200]: %timeit [tuple(x.split('_')) for x in s]
180 ms ± 370 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [201]: %timeit pd.Index(s).str.split("_", expand=True).tolist()
379 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.