简体   繁体   中英

Read large .json file with index format into Pandas dataframe

I was following this answer but after some discussion with it's writer, it seems it only gives a solution to orient='records' data format.

This is the difference:

# orient='records'
[
    {"Product":"Desktop Computer","Price":700},
    {"Product":"Tablet","Price":250},
    {"Product":"iPhone","Price":800},
    {"Product":"Laptop","Price":1200}
]

# orient='index'
{
    "0":{"Product":"Desktop Computer","Price":700},
    "1":{"Product":"Tablet","Price":250},
    "2":{"Product":"iPhone","Price":800},
    "3":{"Product":"Laptop","Price":1200}
}

I have the index format because my data is from an SQL database read into a dataframe and the index field is needed to specify every records.

My json file is 2.5 GB , had been exported from the dataframe with orient='index' format.

df.to_json('test.json', orient='index')

This means that the whole file is actually one huge string and not a list like collection of records:

{"0":{"Product":"Desktop Computer","Price":700},"1":{"Product":"Tablet","Price":250},"2":{"Product":"iPhone","Price":800},"3":{"Product":"Laptop","Price":1200}}

This means I can't use any line or chunck based iterative solution like this:

df = pd.read_json('test.json', orient='index', lines=True, chunksize=5)

According to the documentation , lines=True can only be used if the records are in a list like format, this is why pandas.DataFrame.to_json does not even accept this argument unless the orient is not orient='records' . The restriction for chunksize= comes from this as well, it says:

"This can only be passed if lines=True. If this is None, the file will be read into memory all at once."

And exactly this is the reason of the question, trying to read such a huge.json file gives back:

df = pd.read_json('test.json', orient='index')

File "C:\Users\Username\AppData\Local\Programs\Python\Python37\lib\site-
packages\pandas\io\json\_json.py", line 1100,
in _parse_no_numpy                                                                                          
loads(json, precise_float=self.precise_float),
MemoryError 

I was thinking about adding the index values as a first column as well, this case it wouldn't be lost with the records format; or maybe even store an index list separately. Only I fear it would decrease the search performance later on.

Is there any solution to handle the situation strictly using the.json file and no other database or big-data based technology?

Update #1

For request here is the actual structure of my data. The SQL table:

          Serial           Date                   PatientID     Type Gender  YearWeek
0         425571118001461E 2011-06-30 20:59:30    186092        3    1.0     2011-w26
1         425571118001461E 2011-06-30 20:55:30    186092        3    1.0     2011-w26
2         425571118001461E 2013-08-28 09:29:30    186092        3    1.0     2013-w35
3         425571118001461E 2013-08-28 07:44:30    186092        3    1.0     2013-w35
4         425571118001461E 2013-08-27 20:44:30    186092        3    1.0     2013-w35
...                    ...                 ...       ...      ...    ...         ...
32290281  4183116300254921 2020-04-09 08:07:50    217553        8    2.0     2020-w15
32290282  4183116300254921 2020-04-08 10:29:50    217553        8    2.0     2020-w15
32290283  4141119420031548 2020-04-20 10:18:02    217555       12    2.0     2020-w17
32290284  4141119420043226 2020-04-20 12:33:11    217560       12    NaN     2020-w17
32290285  4141119420000825 2020-04-20 17:31:44    217568       12    1.0     2020-w17

The pandas pivot table is almost the same as in the example, but with a 50,000 rows and 4,000 columns:

df = df.pivot_table(index='PatientID', values='Serial', columns='YearWeek', aggfunc=len, fill_value=0)

YearWeek  1969-w01  1969-w02  1969-w03  1969-w04  1969-w05  ...  2138-w17  2138-w18  2138-w19  2138-w20  2138-w21
PatientID
0                0         0         0         0         0  ...         0         0         0         0         0
455              1         0         3         0         0  ...         0         0         0         0         0
40036            0         0         0         0         0  ...         0         0         0         0         0
40070            0         0         0         0         0  ...         0         0         0         0         0
40082            0         0         0         0         0  ...         0         0         0         0         0
...            ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...
217559           0         0         0         0         0  ...         0         0         0         0         0
217560           0         0         0         0         0  ...         0         0         0         0         0
217561           0         0         0         0         0  ...         0         0         0         0         0
217563           0         0         0         0         0  ...         0         0         0         0         0
217568           0         1         0         2         0  ...         0         0         0         0         0

And this is how it is saved with an index formatted json:

{
    "0":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},
    "455":{"1969-w01":1,"1969-w02":0,"1969-w03":3,"1969-w04":0, ...},
    "40036":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},
    ...
    "217568":{"1969-w01":0,"1969-w02":1,"1969-w03":0,"1969-w04":2, ...}
}

Only I could not give the line=True arg, so it is actually cramped into one huge string making it a one-liner json:

{"0":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},"455":{"1969-w01":1,"1969-w02":0,"1969-w03":3,"1969-w04":0, ...},"40036":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...}, ... "217568":{"1969-w01":0,"1969-w02":1,"1969-w03":0,"1969-w04":2, ...}}

A few solutions, listed from easiest to more involved:

1. SQL

If you can perform the query on the DB, maybe the best solution would be to try writing the data in a nicer format? Alternatively, you could try reading directly from the database - Pandas can do that too:) Here is the documentation for pd.read_sql() .

2. Is orient=... necessary?

To read the JSON file as you gave the example, and create a DataFrame of the form comparable to your pivot-table example (JSON keys as dataframe index), you can try this simple approach:

# read and transpose!
df = pd.read_json("test.json").T

However, this probably doesn't solve the memory issue.

3. Splitting into several files

Perhaps the fastest way would be to simply cut the large file into smaller files, which can each be read into a Pandas Dataframe (limiting the working memory required at any given time), then pd.merge or pd.concat the resulting dataframes.

There is a nice tool in Linux called split , which could do it . I notice you are using windows (newer windows version offer a Linux terminal if you enable it.), Otherwise perhaps there is a similar tool. but I don't know one I'm afraid.

If you only need to do this once then get on with your life, you might be able to open your file with some text editor like Emacs or VS Code, then copy-paste portions into new files... lame, but might work ¯\_(ツ)_/¯

4. Streaming readers

One package called ijson will iteratively load a JSON file, which allows you to define breaks or do processing to each nested division - you could then for example create the records format for Pandas on-the-fly . This solution also promised low memory consumption, being an iterator (aka generator) - you will need to learn how it works though. Have a look here for a nice explanation .

Another package called json-streamer can also read partial JSON contents, although it is perhaps going a bit far, given you have a static file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM