[英]How can i read a python dict contained in a csv file and store the data in a pandas dataframe?
I have a csv where each row is a dictionary, inside each row's dict, there is a list, and this list contains a sublist and a subdict. 我有一个csv,其中每一行都是字典,在每一行的dict中都有一个列表,并且此列表包含一个子列表和一个下标。 Each sublist has 2 elements, and the subdict has 100 keys, and a value per key.
每个子列表都有2个元素,而子对象有100个键,每个键都有一个值。 This is a screenshot of the data:
这是数据的屏幕截图:
Here's a sample of the data in text format: 这是文本格式的数据示例:
{"0": [[10.8, 36.0], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"1": [[10.8, 36.1], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"2": [[10.8, 36.2], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"3": [[10.8, 36.300000000000004], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"4": [[10.8, 36.4], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"5": [[10.8, 36.5], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"6": [[10.8, 36.6], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"7": [[10.8, 36.7], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0, "10": 0}]}
{"8": [[10.8, 36.800000000000004], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0}]}
{"9": [[10.8, 36.9], {"0": 0, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "6": 0, "7": 0, "8": 0, "9": 0}]}
What i would like to do is to read this into a pandas dataframe that would produce an expected output like this (i will just type a single row for the sake of simplicity): 我想做的是将其读入pandas数据帧,该数据帧将产生预期的输出(为简单起见,我将只键入一行):
list_elemnt_1 list_elemnt_2 key_0, key_1, key_2, key_3, key_4, and so on...
value value value value value value value and so on...
For each row in the csv, i would like to build a dataframe with one column per sublist value (2), and one column for each key in the subdict contained on the row's dict. 对于csv中的每一行,我想构建一个数据框,其中每个子列表值(2)包含一列,并且该行的dict中包含的下标中的每个键对应一列。
How could i do this?? 我该怎么办? Please feel free to ask more information if needed.
如果需要,请随时询问更多信息。
Thank you very much in advance 提前非常感谢你
EDIT 编辑
Key_0, key_1, key_2, etc... are the subdict keys, not the master dict keys Key_0,key_1,key_2等...是从属键,而不是主dict键
import ast
import pandas as pd
file = open('file_55966371.csv', 'r')
lines = [ast.literal_eval(line) for line in file]
def clean_lines(line):
value = [v for v in line.values()]
l1, l2 = value[0][0]
line_dict = value[0][1]
line_dict = {f'key_{key}': value for key, value in line_dict.items()}
line_dict['list_element1'] = l1
line_dict['list_element2'] = l2
return line_dict
to_read = [clean_lines(line) for line in lines]
df = pd.DataFrame(to_read)
I agree with @furas this looks a lot like a JSON, and if this data was sourced from someone, it would be best if you could ask them if they could send it to you in JSON format. 我同意@furas,它看起来很像JSON,并且如果此数据是从某人获取的,那么最好问问他们是否可以将其以JSON格式发送给您。
If not, the code above works. 如果不是,则上面的代码有效。
Open the file. 打开文件。
read each line and store it as a list. 阅读每一行并将其存储为列表。
ast.literal_eval
allows Python to recognize that it's a dictionary from the get-go and stores them as dict
objects. ast.literal_eval
允许Python从一开始就认识到它是字典并将它们存储为dict
对象。
i created a helper clean_lines
function. 我创建了一个辅助
clean_lines
函数。 clean_lines is the more important part. clean_lines是更重要的部分。
l1
and l2
l1
和l2
key_X
) d. key_X
)d。 add l1 and l2 as entries into the dictionary, basically combining the sublist and subdict into a single dictionary Once you have a list of dictionaries, pandas will be able to recognize it and you can plug it into a pd.DataFrame
instantiator object 一旦有了字典列表,熊猫便可以识别它,并将其插入到
pd.DataFrame
实例化器对象中
Not the best way to do it. 不是最好的方法。
# Edit for reading the csv
# there are two ways to go about it, I am assuming data is in 1 column
df_csv = pd.read_csv('/path/to/your/file/filename.csv')
# read in the csv, I assume you are able to do this.
list_of_dfs = []
for idx, row in df_csv.iterrows():
d = row[column_name] # find the column name and insert here
df = pd.DataFrame.from_dict(d,orient='index') # creating a dataframe to get the number of lines
remove_cols = df.columns
for i in d.keys():
df['list_elemnt_1'] = d[i][0][0]
df['list_elemnt_2'] = d[i][0][1]
for key in d[i][1].keys():
df[key] = d[i][1][key]
# remove the original cols here
list_of_dfs.append(df)
This will give you the df
s of each line as a line elemnt in the list_of_dfs
which I assumed is the goal? 这将为您提供每行的
df
,作为list_of_dfs
的行list_of_dfs
,我认为这是目标? Let me know if it works. 让我知道它是否有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.