简体   繁体   中英

Special characters when importing from BigQuery to R

I have a script for scrapping some tweets and saving the results to Google BigQuery. When I see the stored data, special characters like ➕, ‍♂️, Ñ, áéíóú appear correctly, but when I try to import the data again to R they are replaced by some strange characters. Here's an example.

# Create df

id_tweet <- 1023985670224785408
tweet <- "◉ Neuroeducación y entornos digitales de aprendizaje: un paso obligado para educadores, pedagogos y psicólogos"
descripcion <- "Desde las alturas se ve todo de otra manera... ️ ➕ ‍♂️"

data <- data.frame(id, tweet, description)

# Save to Google BQ

library(bigrquery)

insert_upload_job("project-id", "dataset", "table", data , write_disposition = "WRITE_APPEND")

#Load from Gooble BQ

sql <- paste("SELECT *", "FROM", "`project-id.dataset.table`")
data <- query_exec(sql, project = "project-id", use_legacy_sql = FALSE)

My output is the following:

> data
               id_tweet
283 1023985670224785408
                                                                                                                                         tweet
283 ◉ Neuroeducación y entornos digitales de aprendizaje: un paso obligado para educadores, pedagogos y psicólogos
                                                                                        descripcion
283 Desde las alturas se ve todo de otra manera... ï¿½ï¿½ï¸ âž• ��<U+200D>â™‚ï¸ ï¿½ï¿½ ��

What I want is to keep the original format.

What should I do?

Thanks,

I tested a few things which may help.

Firstly, I saved the blank R script and ensured it was in UTF-8 encoding: File -> Save with Encoding -> UTF-8. Then saved just the special characters in your question in double quotes as a .csv (ie "➕, ‍♂️, Ñ, áéíóú" ). Then read in the csv with fileEncoding = "UTF-8" , ie:

test <- read.csv("test.csv", fileEncoding = "UTF-8", header=FALSE, stringsAsFactors = FALSE)

Inside R Studio, test returns:

# > test
# V1
# 1 \u2795, ‍♂️, Ñ, áéíóú

So all but the ➕ display nicely in R Studio. However, a lot of characters, even common ones like line breaks, and tabs etc display funnily in RStudio but normally when a file is written. These are no different.

When the csv is written (just using write.csv(test, 'test2.csv', row.names=FALSE) ), it displays perfectly as it did in the original csv (that's when opened in sublime text)

After all this, I would suggest ensuring your encoding is UTF-8, and perhaps trying to save the BQ output as a csv (if possible?) and inspecting it to see if the issue is coming from BQ or R. If it comes out of BQ correctly, then it should be simply a matter of changing the encoding in RStudio. But if it's not coming out of BQ as intended, then I'd suggest you need to change the datatype in BQ (to UTF-8)

After 6 months, I finally managed to solve this problem. Instead of using the function query_exec i used bq_table_download from the same package instead. This function solves the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM