[英]Apply json.loads for a column of dataframe with dask
I have a dataframe fulldb_accrep_united
of such kind:我有这样的 dataframe
fulldb_accrep_united
:
SparkID ... Period
0 913955 ... {"@PeriodName": "2000", "@DateBegin": "2000-01...
1 913955 ... {"@PeriodName": "1999", "@DateBegin": "1999-01...
2 16768 ... {"@PeriodName": "2007", "@DateBegin": "2007-01...
3 16768 ... {"@PeriodName": "2006", "@DateBegin": "2006-01...
4 16768 ... {"@PeriodName": "2005", "@DateBegin": "2005-01...
I need to convert Period
column, which is now column of strings into a column of json
values.我需要将
Period
列(现在是字符串列)转换为json
值列。 Usually I do it with df.apply(lambda x: json.loads(x))
, but this dataframe is too large to process it as a whole.通常我用
df.apply(lambda x: json.loads(x))
来做,但是这个 dataframe 太大而不能作为一个整体来处理。 I want to use dask
, but I seem to miss something important.我想使用
dask
,但我似乎错过了一些重要的事情。 I think I don't understand how to use apply
in dask
, but I can't find out the solution.我想我不明白如何在
dask
中使用apply
,但我找不到解决方案。
The codes代码
This is how I supposed to do it if using Pandas with all df in memory:如果将 Pandas 与 memory 中的所有 df 一起使用,我应该这样做:
#%% read df
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = pd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', index_col = 0, encoding = 'utf-8')
os.chdir('..')
#%% Deleting some freaky symbols from column
condition = fulldb_accrep_united['Period'].str.contains('\\xa0', na = False, regex = False)
fulldb_accrep_united.loc[condition.values, 'Period'] = fulldb_accrep_united.loc[condition.values, 'Period'].str.replace('\\xa0', ' ', regex = False).values
#%% Convert to json
fulldb_accrep_united.loc[fulldb_accrep_united['Period'].notnull(), 'Period'] = fulldb_accrep_united['Period'].dropna().apply(lambda x: json.loads(x))
This is the code where i try to use dask
:这是我尝试使用
dask
的代码:
#%% load data with dask
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = dd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', encoding = 'utf-8', blocksize = 16 * 1024 * 1024) #16Mb chunks
os.chdir('..')
#%% setup calculation graph. No work is done here.
def transform_to_json(df):
condition = df['Period'].str.contains('\\xa0', na = False, regex = False)
df['Period'] = df['Period'].mask(condition.values, df['Period'][condition.values].str.replace('\\xa0', ' ', regex = False).values)
condition2 = df['Period'].notnull()
df['Period'] = df['Period'].mask(condition2.values, df['Period'].dropna().apply(lambda x: json.loads(x)).values)
result = transform_to_json(fulldb_accrep_united)
The last cell here gives error:这里的最后一个单元格给出了错误:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
What I do wrong?我做错了什么? I tried to find similar topics for almost 5 hours, but I think I am missing something important, cause I am new to the topic.
我试图找到类似的主题将近 5 个小时,但我认为我遗漏了一些重要的东西,因为我是这个主题的新手。
Your question was long enough that I didn't read through all of it.你的问题太长了,我没有读完。 My apologies.
我很抱歉。 See https://stackoverflow.com/help/minimal-reproducible-example
请参阅https://stackoverflow.com/help/minimal-reproducible-example
However, based on the title, it may be that you want to apply the json.loads function across every element in a dataframe's column但是,根据标题,您可能希望在数据框列中的每个元素上应用 json.loads function
df["column-name"] = df["column-name"].apply(json.loads)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.