[英]Parsing a JSON string enclosed with quotation marks from a CSV using Pandas
Similar to this question , but my CSV has a slightly different format. 与此问题类似,但我的CSV格式略有不同。 Here is an example:
这是一个例子:
id,employee,details,createdAt
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"
I think the double quotation mark in the beginning of the JSON column might have caused some errors. 我认为JSON列开头的双引号可能引起了一些错误。 Using
df = pandas.read_csv('file.csv')
, this is the dataframe that I got: 使用
df = pandas.read_csv('file.csv')
,这是我得到的数据df = pandas.read_csv('file.csv')
:
id employee details createdAt Unnamed: 1 Unnamed: 2
1 John {Country":"USA" Salary:5000 Review:null}" 2018-09-01
2 Sarah {Country":"Australia" Salary:6000 Review:"Hardworking"}" 2018-09-05
My desired output: 我想要的输出:
id employee details createdAt
1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
2 Sarah {"Country":"Australia","Salary":6000,"Review":"Hardworking"} 2018-09-05
I've tried adding quotechar='"'
as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value? 我尝试添加
quotechar='"'
作为参数,但仍然没有得到我想要的结果。有没有办法告诉熊猫忽略json值的第一个和最后一个引号?
I have reproduced your file With 我已经用
df = pd.read_csv('e1.csv', index_col=None )
print (df)
Output 输出量
id emp details createdat
0 1 john "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
I think there's a better way by passing a regex to sep=r',"|",|(?<=\\d),'
and possibly some other combination of parameters. 我认为通过将正则表达式传递给
sep=r',"|",|(?<=\\d),'
和其他可能的参数组合,是一种更好的方法。 I haven't figured it out totally. 我还没有完全弄清楚。
Here is a less than optimal option: 这不是一个最佳选择:
df = pd.read_csv('s083838383.csv', sep='@#$%^', engine='python')
header = df.columns[0]
print(df)
Why sep='@#$%^'
? 为什么
sep='@#$%^'
? This is just garbage that allows you to read the file with no sep character. 这只是垃圾,它使您可以读取不带sep字符的文件。 It could be any random character and is just used as a means to import the data into a
df
object to work with. 它可以是任何随机字符,仅用作将数据导入
df
对象以进行处理的一种方式。
df
looks like this: df
看起来像这样:
id,employee,details,createdAt
0 1,John,"{"Country":"USA","Salary":5000,"Review...
1 2,Sarah,"{"Country":"Australia", "Salary":6000...
Then you could use str.extract
to apply regex and expand the columns: 然后,您可以使用
str.extract
来应用正则表达式并扩展列:
result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
expand=True).applymap(str.strip)
result.columns = header.strip().split(',')
print(result)
result
is: result
是:
id employee details createdAt
0 1 John "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 Sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
If you need the starting and ending quotes stripped off of the details
string values, you could do: 如果您需要从
details
字符串值中删除开头和结尾的引号,则可以执行以下操作:
result['details'] = result['details'].str.strip('"')
If the details
object items needs to be a dict
s instead of strings, you could do: 如果
details
对象项需要是dict
而不是字符串,则可以执行以下操作:
from json import loads
result['details'] = result['details'].apply(loads)
As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data
to contruct the dataframe. 作为一种替代方法,您可以手动读取文件,正确解析每一行,然后使用生成的
data
构造数据框。 This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part: 通过向前和向后拆分行以获取非问题列,然后使用其余部分来进行工作:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
This would display your data as: 这会将您的数据显示为:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.