简体   繁体   English

提高从pandas列中提取信息的速度

[英]improve speed of extracting information from pandas columns

I have a dataframe with around 200,000 datapoints and a column which looks like this (example for 1 datapoint): 我有一个包含大约200,000个数据点的数据帧和一个看起来像这样的列(1个数据点的示例):

'{"id":342,"name":"Web","slug":"technology/web","position":15,"parent_id":16,"color":6526716,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/technology/web"}}}'

I want to extract information about the name and slug. 我想提取有关名称和slug的信息。 I did the following: 我做了以下事情:

df["cat"], df["slug"] = np.nan, np.nan

for i in range(0, len(df.category)):
    df["cat"][i] = df.category.iloc[i].split('"name":"')[1].split('"')[0]
    df["slug"][i] = df.category.iloc[i].split('"name":"')[1].split('"')[4]

This works perfectly fine, but it takes around 4 hours. 这非常好,但需要大约4个小时。 Is there any way to make this faster? 有没有办法让这更快?

Instead of manipulating a DataFrame directly, try using simple data types and create a dataframe in one go. 不要直接操作DataFrame,而是尝试使用简单的数据类型并一次创建数据帧。 Another solution other than jezrael's: 除了以色列之外的另一种解决方案:

import json

cat, slug = [], []

for row in df.category:
    d = json.loads(row)
    cat.append(d['cat'])
    slug.append(d['slug'])

df = pd.DataFrame({'cat': cat, 'slug': slug})

You can do it very efficiently with extract and regular expressions: 您可以使用extract和正则表达式非常有效地执行此操作:

df['cat'] = df['category'].str.extract('"name":"([^"]+)"')
df['slug'] = df['category'].str.extract('"slug":"([^"]+)"')

df

The question was about improving speed, so here's the performance comparison (tested on a 100,000 rows sample; see note below): 问题是关于提高速度,所以这里是性能比较(在100,000行样本上测试;参见下面的注释):

%%timeit

df['cat'] = df['category'].str.extract('"name":"([^"]+)"')
df['slug'] = df['category'].str.extract('"slug":"([^"]+)"')

309 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit

cat, slug = [], []
for row in df.category:
    d = json.loads(row)
    cat.append(d['name'])
    slug.append(d['slug'])

df1 = pd.DataFrame({'cat': cat, 'slug': slug})

574 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit

df1 = pd.DataFrame([ast.literal_eval(x) for x in df['category']],
                   index=df.index)[['name','slug']]

5.1 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note: sample generated with: 注意:生成的样本:

x = '{"id":342,"name":"Web","slug":"technology/web","position":15,"parent_id":16,"color":6526716,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/technology/web"}}}'
df = pd.DataFrame({'category': [x]*100000})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM