I have a csv file that as ~1.9M rows and 32 columns. I also have limited RAM, which makes it loading into the memory very inconvenient. As a result I am thinking of using a database but do not have any intimate knowledge on the subject and so have have looked around at this site but found no viable solns so far.
The CSV file looks like this:
Case,Event,P01,P02,P03,P04,P05,P06,P07,P08,P09,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30
C000039,E97553,8,10,90,-0.34176313227395744,-5.581162038780728E-4,-0.12090388100201072,-1.5172412910939355,-0.9075283173030568,2.0571877671625742,-0.002902632819930783,-0.6761896565590585,-0.7258602353522214,0.8684602429202587,0.0023189312896576167,0.002318939470525324,-0.1881462494296103,-0.0014303471592995315,-0.03133299206977217,7.72338072867324E-4,-0.08952068388668191,-1.4536398437657685,-0.020065144945600275,-0.16276139919188118,0.6915962670997067,-1.593412697264055,-1.563877781707804,-1.4921751129092755,4.701551108078644,6,-0.688302560842075
C000039,E23039,8,10,90,-0.3420173545012358,-5.581162038780728E-4,-1.6563770995734233,-1.5386562526752448,-1.3604342580422861,2.1025445031625525,-0.0028504751366762804,-0.6103972392687121,-2.0390388918403284,-1.7249948885013526,0.00231891181914203,0.0023189141684282384,-0.18603688853814693,-0.0014303471592995315,-0.03182759137355937,0.001011754948131039,0.13009444290656555,-1.737249614361576,-0.015763602969926262,-0.16276139919188118,0.7133868949811379,-1.624962995908364,-1.5946762525901037,-1.5362787555380522,4.751479927607516,6,-0.688302560842075
C000039,E23039,35,10,90,-0.3593468363273839,-5.581162038780728E-4,-2.2590624066428937,-1.540784192984501,-1.3651511418164592,0.05539868728273849,-0.00225912499740972,0.20899232681704485,-2.2007336302050633,-2.518401278903022,0.0023189850665203673,0.0023189834133465186,-0.1386548782028836,-0.0013092574968056093,-0.0315006293688149,9.042390365542781E-4,-0.3514180333671346,-1.8007561969675518,-0.008593259125791147,-2.295351187387221,0.6329101442826701,-1.8095530459660578,-1.7748676145152822,-1.495347406256394,2.553693742122162,34,-0.6882806822066699
.... .... upto 1.9 M rows
As you can see the 'Case' column repeats itself but I want to only get unique records before importing it into a dataframe. So i used this:
f<-file("test.csv")
bigdf <- sqldf("select * from 'f' where Case in (select Case from 'f' group by Case having count(*) = 1)", dbname = tempfile(), file.format = list(header = T, row.names = F))
However I get this error:
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: near "in": syntax error)
Is there something obvious I am missing here. Much thanks in advance.
CASE
是一个关键字 ,因此您必须在查询中将此列名引用为"Case"
。
For those who want unique rows using sqldf
, use DISTINCT
:
newdf <- sqldf("SELECT DISTINCT * FROM df") # Get unique rows
sqldf
uses SQLite Syntax by default.
newdf <- sqldf("SELECT DISTINCT name FROM df") # Get unique column values
newdf <- sqldf("SELECT *, COUNT(DISTINCT name) as NoNames FROM df GROUP BY whatever") # Get a count of unique names
if you use "Case" in sqldf in R, you should put a "," before "Case". Because the "Case" query is the whole line, you should make it seperate.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.