熊猫：将字符串元组列表更快地转换为数据框？

Question

From a text field I have the following input series, containing geographic coordinate tuples as a string: 在文本字段中，我具有以下输入序列，其中包含地理坐标元组作为字符串：

import pandas as pd

coords = pd.Series([
   '(29.65271977700047, -82.33086252299967)',
   '(29.652914019000434, -82.42682220199964)',
   '(29.65301114200048, -82.36455186899968)',
   '(29.642610841000476, -82.29853169599966)',
])

I would like to parse the numbers in these tuples and end up with the following result DataFrame: 我想解析这些元组中的数字，并最终得到以下结果 DataFrame：

         lat        lon
0  29.652720 -82.330863
1  29.652914 -82.426822
2  29.653011 -82.364552
3  29.642611 -82.298532

This is what I have come up with: 这是我想出的：

str_coords = coords.str[1:-1].str.split(', ')
latlon = str_coords.apply(pd.Series).astype(float)
latlon.columns = ['lat', 'lon']

My problem : The call to .apply(pd.Series) takes "forever" on the real list, which has around 1.2 million entries. 我的问题：对.apply(pd.Series)的调用在实际列表上花了“永远”，大约有120万个条目。 Is there a faster way? 有没有更快的方法？

Answer 1

Another way to access the first and second element of the list also through the str : 另一种通过str访问列表的第一个和第二个元素的方法：

In [174]: coords = pd.Series([
   .....:    '(29.65271977700047, -82.33086252299967)',
   .....:    '(29.652914019000434, -82.42682220199964)',
   .....:    '(29.65301114200048, -82.36455186899968)',
   .....:    '(29.642610841000476, -82.29853169599966)'])

In [175]: str_coords = coords.str[1:-1].str.split(', ')

In [176]: coords_df = pd.DataFrame({'lat': str_coords.str[0], 'lon': str_coords.str[1]})

In [177]: coords_df.astype(float).head()
Out[177]:
         lat        lon
0  29.652720 -82.330863
1  29.652914 -82.426822
2  29.653011 -82.364552
3  29.642611 -82.298532
4  29.652720 -82.330863

Some timings indicate that both my solution as that of @ajcr is much faster than the apply(pd.Series) approach (and the difference between both is negligible): 一些时间表明，我的@ajcr解决方案都比apply（pd.Series）方法快得多（两者之间的差异可以忽略不计）：

In [197]: coords = pd.Series([
   .....:    '(29.65271977700047, -82.33086252299967)',
   .....:    '(29.652914019000434, -82.42682220199964)',
   .....:    '(29.65301114200048, -82.36455186899968)',
   .....:    '(29.642610841000476, -82.29853169599966)'])

In [198]: coords = pd.concat([coords]*1000, ignore_index=True)


In [199]: %%timeit
   .....: str_coords = coords.str[1:-1].str.split(', ')
   .....: df_coords = pd.DataFrame({'lat': str_coords.str[0], 'lon': str_coords.str[1]}, dtype=float)
   .....:
100 loops, best of 3: 14.1 ms per loop

In [200]: %%timeit
   .....: str_coords = coords.str[1:-1].str.split(', ')
   .....: df_coords = str_coords.apply(pd.Series).astype(float)
   .....:
1 loops, best of 3: 821 ms per loop

In [201]: %%timeit
   .....: df_coords = coords.str.extract(r'\((?P<lat>[\d\.]+),\s+(?P<lon>[^()\s,]+)\)')
   .....: df_coords.astype(float)
   .....:
100 loops, best of 3: 16.2 ms per loop

Answer 2

Another way could be to use the vectorised string method extract : 另一种方法是使用向量化字符串方法extract ：

>>> coords.str.extract(r'\((?P<lat>[\-\d\.]+),\s+(?P<lon>[\-\d\.]+)\)')
                  lat                 lon
0   29.65271977700047  -82.33086252299967
1  29.652914019000434  -82.42682220199964
2   29.65301114200048  -82.36455186899968
3  29.642610841000476  -82.29853169599966

You can pass named regex capture groups to extract - it will create a DataFrame with the group names as column names. 您可以传递命名的正则表达式捕获组以进行extract -它会创建一个以组名作为列名的DataFrame。

You can then cast this DataFrame df to a float datatype: 然后，您可以将此DataFrame df强制转换为float数据类型：

>>> df.astype(float)
         lat        lon
0  29.652720 -82.330863
1  29.652914 -82.426822
2  29.653011 -82.364552
3  29.642611 -82.298532

熊猫：将字符串元组列表更快地转换为数据框？

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-02-04 11:22:06

解决方案2
3 2015-02-04 11:22:19

熊猫：将字符串元组列表更快地转换为数据框？

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-02-04 11:22:06

解决方案2 3 2015-02-04 11:22:19

解决方案1
3 已采纳 2015-02-04 11:22:06

解决方案2
3 2015-02-04 11:22:19