简体   繁体   中英

Parsing json lines using pandas

I have a large json file with millions of lines. This file also has some error messages in it. Below is the sample:

{"MEASUREMENT_1":"12345678","MEASUREMENT_2":"123456789012","MEASUREMENT_3":"MEASUREMENT_TYPE","MEASUREMENT_4":1111111111111,"MEASUREMENT_5":-1122,"MEASUREMENT_6":-2233,"MEASUREMENT_7":"123456789"}
{"MEASUREMENT_1":"87654321","MEASUREMENT_2":"987654321098","MEASUREMENT_3":"MEASUREMENT_TYPE_2","MEASUREMENT_4":222222222222,"MEASUREMENT_5":-4455,"MEASUREMENT_6":-6677,"MEASUREMENT_7":"123456789"}
[2015-12-02 02:00:02,530] WARN Reconnect due to socket error: null 
[2015-12-02 02:00:02,633] WARN Reconnect due to socket error: null 

As expected, the below code throws a ValueError because of the error message line in the file.

#!/usr/bin/python3.5
import pandas as pd # Version 0.21.0
df = pd.read_json(file, lines=True)

As this is very large file, I have used chunksize and an expection, as below:

max_records = 1e5
df = pd.read_json(file, lines=True, chunksize=max_records)
filtered_data = pd.DataFrame() # Initialize the dataframe
try:
   for df_chunk in df:
       filtered_data = pd.concat([filtered_data, df_chunk])
except ValueError:
       print ('\nSome messages in the file cannot be parsed')

But the drawback of the above approach is it misses some of the lines. Is there any better way to do this? I went through the documentation of http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html but couldn't find anything that can ignore the unparsed lines. Can someone help?

Finally, found a solution to get rid of error messages in the file. However this procedure adds additional time to clean the file and saves it as new file

    #!/usr/bin/python3.5

    import re
    import pandas as pd # Version 0.21.0

    def clean_data(filename):
        with open(filename, "r") as inputfile:
            for row in inputfile:
                if re.match("\[", row) is None:
                    yield row

   with open(clean_file,  'w') as outputfile:
        for row in clean_data(filename):
            outputfile.write(row)

   max_records = 1e5
   df = pd.read_json(clean_file, lines=True, chunksize=max_records)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM