简体   繁体   中英

Reading a pickle file (PANDAS Python Data Frame) in R

Is there an easy way to read pickle files (.pkl) from Pandas Dataframe into R?

One possibility is to export to CSV and have R read the CSV but that seems really cumbersome for me because my dataframes are rather large. Is there an easier way to do so?

Thanks!

Reticulate was quite easy and super smooth as suggested by russellpierce in the comments.

install.packages('reticulate')

After which I created a Python script like this from examples given in their documentation.

Python file:

import pandas as pd

def read_pickle_file(file):
    pickle_data = pd.read_pickle(file)
    return pickle_data

And then my R file looked like:

require("reticulate")

source_python("pickle_reader.py")
pickle_data <- read_pickle_file("C:/tsa/dataset.pickle")

This gave me all my data in R stored earlier in pickle format.

You can also do this all in-line in R without leaving your R editor (provided your system python can reach pandas)... eg

library(reticulate)
pd <- import("pandas")
pickle_data <- pd$read_pickle("dataset.pickle")

Edit: If you can install and use the {reticulate} package, then this answer is probably outdated. See the other answers below for an easier path.

You could load the pickle in python and then export it to R via the python package rpy2 (or similar). Once you've done so, your data will exist in an R session linked to python. I suspect that what you'd want to do next would be to use that session to call R and saveRDS to a file or RAM disk. Then in RStudio you can read that file back in. Look at the R packages rJython and rPython for ways in which you could trigger the python commands from R.

Alternatively, you could write a simple python script to load your data in Python (probably using one of the R packages noted above) and write a formatted data stream to stdout. Then that entire system call to the script (including the argument that specifies your pickle) can use used as an argument to fread in the R package data.table . Alternatively, if you wanted to keep to standard functions, you could use combination of system(..., intern=TRUE) and read.table .

As usual, there are /many/ ways to skin this particular cat. The basic steps are:

  1. Load the data in python
  2. Express the data to R (eg, exporting the object via rpy2 or writing formatted text to stdout with R ready to receive it on the other end)
  3. Serialize the expressed data in R to an internal data representation (eg, exporting the object via rpy2 or fread )
  4. (optional) Make the data in that session of R accessible to another R session (ie, the step to close the loop with rpy2, or if you've been using fread then you're already done).

To add to the answer above: you might need to point to a different conda env to get to pandas:

use_condaenv("name_of_conda_env", conda = "<<result_of `which conda`>>")
pd <- import('pandas')

df <- pd$read_pickle(paste0(outdir, "df.pkl"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM