简体   繁体   中英

rpy2 R->Python data.frame: bad NA and custom converter

At the moment I'm getting my feet wet with with the rpy2 package (it's rather cool).

But I'm running in a similar issue as discussed in this question , where the type conversion between an R data.frame and a python pandas dataframe is rather messy:

import rpy2.robjects as ro

from rpy2.robjects.conversion import localconverter
from rpy2.robjects import pandas2ri

# float -> NaN
# int -> -2147483648
# Strings -> None
# Bool -> int

ro.r(
    """
        f <- function() {
              return(data.frame(int = c(4L, NA),
                     float = c(1.2, NA),
                     chr = c("A", NA),
                     bool = c(TRUE, FALSE)))
        }
        f()
    """
)

r_f = ro.globalenv["f"]
res = r_f()

with localconverter(ro.default_converter + pandas2ri.converter):
    pd_from_r_df = ro.conversion.rpy2py(res)

print(pd_from_r_df)

results in:

python -u "/workspace/SLW/Python_examples/MWE_2.py"
          int  float   chr  bool
1           4    1.2     A     1
2 -2147483648    NaN  None     0

As you can see the integer gets blown up and the boolean turned into an integer. Since this feels like such an ordinary case, I'm sure a lot of people already ran into it and maybe designed their own custom converter like advised here . Do you maybe have the code of such an converter for me? So I can start with something in my hands, since at the moment I have no clue how to go about writing my own custom converter. Maybe in the long run such an converter could become a part of the rpy2 package.

Here is a simple, completely custom converter from R to pandas DataFrame:

from rpy2.robjects.conversion import localconverter, get_conversion
from rpy2 import rinterface as ri
import rpy2.robjects as ro
from rpy2.rinterface_lib import na_values

import pandas as pd

# create your own rules for df columns
df_rules = ro.default_converter

@df_rules.rpy2py.register(ri.IntSexpVector)
def to_int(obj):
    return [int(v) if v != na_values.NA_Integer else pd.NA for v in obj]


@df_rules.rpy2py.register(ri.FloatSexpVector)
def to_float(obj):
    return [float(v) if v != na_values.NA_Real else pd.NA for v in obj]


@df_rules.rpy2py.register(ri.StrSexpVector)
def to_str(obj):
    return [str(v) if v != na_values.NA_Character else pd.NA for v in obj]


@df_rules.rpy2py.register(ri.BoolSexpVector)
def to_bool(obj):
    return [bool(v) if v != na_values.NA_Logical else pd.NA for v in obj]

# define the top-level converter
def toDataFrame(obj):
    cv = get_conversion() # get the converter from current context
    return pd.DataFrame(
        {str(k): cv.rpy2py(obj[i]) for i, k in enumerate(obj.names)}
    )

# associate the converter with R data.frame class
df_rules.rpy2py_nc_map[ri.ListSexpVector].update({"data.frame": toDataFrame})


# code in OP
ro.r(
    """
        f <- function() {
              return(data.frame(int = c(4L, NA),
                     float = c(1.2, NA),
                     chr = c("A", NA),
                     bool = c(TRUE, FALSE)))
        }
        f()
    """
)

r_f = ro.globalenv["f"]
res = r_f()

with localconverter(df_rules): # use the defined rules here
    pd_from_r_df = res

print(pd_from_r_df)

This prints

    int  float   chr   bool
0     4    1.2     A   True
1  <NA>    NaN  <NA>  False

So it's not perfect: R's float and bool NA values are not capturable by the custom converter (at least in the way I've implemented here) but int and str are working as intended.

Also, it may not be the fastest solution as I believe padas2ri module uses numpy's buffer copy mechanism, but atm I don't know how to deal with the NAs in such conversion.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM