简体   繁体   English

使用Pandas从CSV解析带引号的JSON字符串

[英]Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Similar to this question , but my CSV has a slightly different format. 此问题类似,但我的CSV格式略有不同。 Here is an example: 这是一个例子:

id,employee,details,createdAt  
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"  
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"

I think the double quotation mark in the beginning of the JSON column might have caused some errors. 我认为JSON列开头的双引号可能引起了一些错误。 Using df = pandas.read_csv('file.csv') , this is the dataframe that I got: 使用df = pandas.read_csv('file.csv') ,这是我得到的数据df = pandas.read_csv('file.csv')

id  employee                details    createdAt              Unnamed: 1  Unnamed: 2 
 1      John        {Country":"USA"  Salary:5000           Review:null}"  2018-09-01 
 2     Sarah  {Country":"Australia"  Salary:6000  Review:"Hardworking"}"  2018-09-05

My desired output: 我想要的输出:

id  employee                                                       details   createdAt
 1      John                 {"Country":"USA","Salary":5000,"Review":null}  2018-09-01 
 2     Sarah  {"Country":"Australia","Salary":6000,"Review":"Hardworking"}  2018-09-05 

I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value? 我尝试添加quotechar='"'作为参数,但仍然没有得到我想要的结果。有没有办法告诉熊猫忽略json值的第一个和最后一个引号?

I have reproduced your file With 我已经用

   df = pd.read_csv('e1.csv', index_col=None )

print (df)

Output 输出量

     id    emp                                            details      createdat
0   1   john    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01" 
1   2  sarah  "{"Country":"Australia", "Salary":6000,"Review...   "2018-09-05"

I think there's a better way by passing a regex to sep=r',"|",|(?<=\\d),' and possibly some other combination of parameters. 我认为通过将正则表达式传递给sep=r',"|",|(?<=\\d),'和其他可能的参数组合,是一种更好的方法。 I haven't figured it out totally. 我还没有完全弄清楚。

Here is a less than optimal option: 这不是一个最佳选择:

df = pd.read_csv('s083838383.csv', sep='@#$%^', engine='python')
header = df.columns[0]
print(df)

Why sep='@#$%^' ? 为什么sep='@#$%^' This is just garbage that allows you to read the file with no sep character. 这只是垃圾,它使您可以读取不带sep字符的文件。 It could be any random character and is just used as a means to import the data into a df object to work with. 它可以是任何随机字符,仅用作将数据导入df对象以进行处理的一种方式。

df looks like this: df看起来像这样:

                       id,employee,details,createdAt
0  1,John,"{"Country":"USA","Salary":5000,"Review...
1  2,Sarah,"{"Country":"Australia", "Salary":6000...

Then you could use str.extract to apply regex and expand the columns: 然后,您可以使用str.extract来应用正则表达式并扩展列:

result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
                                expand=True).applymap(str.strip)

result.columns = header.strip().split(',')
print(result)

result is: result是:

  id employee                                            details     createdAt
0  1     John    "{"Country":"USA","Salary":5000,"Review":null}"  "2018-09-01"
1  2    Sarah  "{"Country":"Australia", "Salary":6000,"Review...  "2018-09-05"

If you need the starting and ending quotes stripped off of the details string values, you could do: 如果您需要从details字符串值中删除开头和结尾的引号,则可以执行以下操作:

result['details'] = result['details'].str.strip('"')

If the details object items needs to be a dict s instead of strings, you could do: 如果details对象项需要是dict而不是字符串,则可以执行以下操作:

from json import loads
result['details'] = result['details'].apply(loads)

As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. 作为一种替代方法,您可以手动读取文件,正确解析每一行,然后使用生成的data构造数据框。 This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part: 通过向前和向后拆分行以获取非问题列,然后使用其余部分来进行工作:

import pandas as pd

data = []

with open("e1.csv") as f_input:
    for row in f_input:
        row = row.strip()
        split = row.split(',', 2)
        rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
        data.append(split[0:2] + rsplit)

df = pd.DataFrame(data[1:], columns=data[0])
print(df)

This would display your data as: 这会将您的数据显示为:

  id employee                                            details   createdAt
0  1     John      {"Country":"USA","Salary":5000,"Review":null}  2018-09-01
1  2    Sarah  {"Country":"Australia", "Salary":6000,"Review"...  2018-09-05

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM