提高从pandas列中提取信息的速度

Question

I have a dataframe with around 200,000 datapoints and a column which looks like this (example for 1 datapoint): 我有一个包含大约200,000个数据点的数据帧和一个看起来像这样的列（1个数据点的示例）：

'{"id":342,"name":"Web","slug":"technology/web","position":15,"parent_id":16,"color":6526716,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/technology/web"}}}'

I want to extract information about the name and slug. 我想提取有关名称和slug的信息。 I did the following: 我做了以下事情：

df["cat"], df["slug"] = np.nan, np.nan

for i in range(0, len(df.category)):
    df["cat"][i] = df.category.iloc[i].split('"name":"')[1].split('"')[0]
    df["slug"][i] = df.category.iloc[i].split('"name":"')[1].split('"')[4]

This works perfectly fine, but it takes around 4 hours. 这非常好，但需要大约4个小时。 Is there any way to make this faster? 有没有办法让这更快？

Answer 1

Instead of manipulating a DataFrame directly, try using simple data types and create a dataframe in one go. 不要直接操作DataFrame，而是尝试使用简单的数据类型并一次创建数据帧。 Another solution other than jezrael's: 除了以色列之外的另一种解决方案：

import json

cat, slug = [], []

for row in df.category:
    d = json.loads(row)
    cat.append(d['cat'])
    slug.append(d['slug'])

df = pd.DataFrame({'cat': cat, 'slug': slug})

Answer 2

You can do it very efficiently with extract and regular expressions: 您可以使用extract和正则表达式非常有效地执行此操作：

df['cat'] = df['category'].str.extract('"name":"([^"]+)"')
df['slug'] = df['category'].str.extract('"slug":"([^"]+)"')

df

The question was about improving speed, so here's the performance comparison (tested on a 100,000 rows sample; see note below): 问题是关于提高速度，所以这里是性能比较（在100,000行样本上测试;参见下面的注释）：

%%timeit

df['cat'] = df['category'].str.extract('"name":"([^"]+)"')
df['slug'] = df['category'].str.extract('"slug":"([^"]+)"')

309 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit

cat, slug = [], []
for row in df.category:
    d = json.loads(row)
    cat.append(d['name'])
    slug.append(d['slug'])

df1 = pd.DataFrame({'cat': cat, 'slug': slug})

574 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit

df1 = pd.DataFrame([ast.literal_eval(x) for x in df['category']],
                   index=df.index)[['name','slug']]

5.1 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note: sample generated with: 注意：生成的样本：

x = '{"id":342,"name":"Web","slug":"technology/web","position":15,"parent_id":16,"color":6526716,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/technology/web"}}}'
df = pd.DataFrame({'category': [x]*100000})

提高从pandas列中提取信息的速度

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-03-26 09:28:46

解决方案2
1 2019-03-26 10:47:41

提高从pandas列中提取信息的速度

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-03-26 09:28:46

解决方案2 1 2019-03-26 10:47:41

解决方案1
1 已采纳 2019-03-26 09:28:46

解决方案2
1 2019-03-26 10:47:41