简体   繁体   English

R或Python用于文件操作

[英]R or Python for file manipulation

I have 4 reasonably complex r scripts that are used to manipulate csv and xml files. 我有4个相当复杂的r脚本,用于操作csv和xml文件。 These were created by another department where they work exclusively in r. 这些是由另一个部门创建的,他们专门在r。

My understanding is that while r is very fast when dealing with data, it's not really optimised for file manipulation. 我的理解是,虽然在处理数据时r非常快,但它并没有真正针对文件操作进行优化。 Can I expect to get significant speed increases by converting these scripts to python? 我可以通过将这些脚本转换为python来获得显着的速度提升吗? Or is this something of a waste of time? 或者这是浪费时间吗?

I write in both R and Python regularly. 我定期用R和Python写。 I find Python modules for writing, reading and parsing information easier to use, maintain and update. 我发现用于编写,阅读和解析信息的Python模块更易于使用,维护和更新。 Little niceties like the way python lets you deal with lists of items over R's indexing make things much easier to read. 像python一样让你处理R的索引上的项目列表的方式很少,这使得事情更容易阅读。

I highly doubt you will gain any significant speed-up by switching the language. 我非常怀疑你会通过改变语言获得任何显着的加速。 If you are becoming the new "maintainer" of these scripts and you find Python easier to understand and extend, then I'd say go for it. 如果您正在成为这些脚本的新“维护者”,并且您发现Python更容易理解和扩展,那么我会说它去了。

Computer time is cheap ... programmer time is expensive. 电脑时间很便宜......程序员时间很贵。 If you have other things to do then I'd just limp along with what you've got until you have a free day to putz with them. 如果你还有其他的事情要做,那么我会跟你所拥有的一样跛行,直到你有一个免费的日子和他们一起玩。

Hope that helps. 希望有所帮助。

Few weeks ago, I wrote a Python script to extract some rows from a large (280 MB) CSV file . 几周前,我写了一个Python脚本来从一个大的(280 MB) CSV文件中提取一些行。 More precisely, I wanted to extract all available information on companies in the dbpedia that have an ISIN field. 更准确地说,我想提取有关具有ISIN字段的dbpedia中公司的所有可用信息。 Later I tried the same in R, but as hard as I tried, the R script took 10x more than the python script (10min vs 1min on my rather old laptop). 后来我在R中试过了同样的东西,但是和我试过的一样,R脚本比python脚本多10倍(在我相当旧的笔记本电脑上10分钟对1分钟)。 Maybe this is due to my knowledge of R, in which case I would appreciate any hint on how to make the script faster. 也许这是由于我对R的了解,在这种情况下,我会欣赏任何关于如何使脚本更快的提示。 Here is the python code 这是python代码

from time import clock

clock()
infile = "infobox_de.csv"
outfile = "companies.csv"

reader = open(infile, "rb")
writer = open(outfile, "w")

oldthing = ""
isCompany = False
hasISIN = False
matches = 0

for line in reader:
    row = line.strip().split("\t")
    if len(row)>0: thing = row[0]
    if len(row)>1: key = row[1]
    if len(row)>2: value = row[2]
    if (len(row)>0) and (oldthing != thing):
      if isCompany and hasISIN:
        matches += 1
        for tup in buf:
          writer.write(tup)
      buf = []
      isCompany = False
      hasISIN = False
    isCompany = isCompany or ((key.lower()=="wikipageusestemplate") and (value.lower()=="template:infobox_unternehmen"))
    hasISIN = hasISIN or ((key.lower()=="isin") and (value!=""))
    oldthing = thing
    buf.append(line)

writer.close()
print "finished after ", clock(), " seconds; ", matches, " matches."

and here is the R script (I do not have the equivalent version anymore, but a very similar which returns a dataframe instead of writing a csv file and does not check for ISIN): 这里是R脚本(我没有相应的版本,但是非常类似于返回数据帧而不是编写csv文件而不检查ISIN):

infile <- "infobox_de.csv"
maxLines=65000

reader <- file(infile, "r")
writer <- textConnection("queryRes", open = "w", local = TRUE)
writeLines("thing\tkey\tvalue\tetc\n", writer)

oldthing <- ""
hasInfobox <- FALSE
lineNumber <- 0
matches <- 0
key <- ""
thing <- ""

repeat {
  lines <- readLines(reader, maxLines)
  if (length(lines)==0) break
  for (line in lines) {
    lineNumber <- lineNumber + 1
    row = unlist(strsplit(line, "\t"))
    if (length(row)>0) thing <- row[1]
    if (length(row)>1) key <- row[2]
    if (length(row)>2) value <- row[3]
    if ((length(row)>0) && (oldthing != thing)) {
      if (hasInfobox) {
        matches <- matches + 1
        writeLines(buf, writer)
      }
      buf <- c()
      hasInfobox <- FALSE
    }
    hasInfobox <- hasInfobox || ((tolower(key)=="wikipageusestemplate") && (tolower(value)==tolower("template:infobox_unternehmen")))
    oldthing <- thing
    buf <- c(buf, line)
  }
}
close(reader)
close(writer)
readRes <- textConnection(queryRes, "r")
result <- read.csv(readRes, sep="\t", stringsAsFactors=FALSE)
close(readRes)
result

What I did explicitly, was to restrict readLines to 65000 lines maximum. 我明确做的是将readLines限制为最多65000行。 I did this because I thought my 500MB RAM machine would be run out of memory otherwise. 我这样做是因为我认为我的500MB RAM机器会耗尽内存。 I did not try without this restriction. 我没有尝试没有这个限制。

Know where the time is being spent. 知道花时间在哪里。 If your R scripts are bottlenecked on disk IO (and that is very possible in this case), then you could rewrite them in hand-optimized assembly and be no faster. 如果您的R脚本在磁盘IO上存在瓶颈(在这种情况下很可能),那么您可以在手动优化的程序集中重写它们并且不会更快。 As always with optimization, if you don't measure first, you're just pissing into the wind. 与优化一样,如果你不先测量,你只是在风中撒尿。 If they're not bottlenecked on disk IO, you would likely see more benefit from improving the algorithm than changing the language. 如果它们在磁盘IO上没有瓶颈,那么改进算法可能会比改变语言更有益。

what do you mean by "file manipulation?" “文件操作”是什么意思? are you talking about moving files around, deleting, copying, etc., in which case i would use a shell, eg, bash, etc. if you're talking about reading in the data, performing calculations, perhaps writing out a new file, etc., then you could probably use Python or R. unless maintenance is an issue, i would just leave it as R and find other fish to fry as you're not going to see enough of a speedup to justify your time and effort in porting that code. 你在谈论移动文件,删除,复制等,在这种情况下我会使用shell,例如bash等,如果你在谈论读取数据,执行计算,或者写出一个新文件等等,然后你可能会使用Python或R.除非维护是一个问题,我会把它留作R并找到其他鱼来炒,因为你不会看到足够的加速来证明你的时间和精力移植该代码。

My guess is that you probably won't see much of a speed-up in time. 我的猜测是你可能不会及时看到很多加速。 When comparing high-level languages, overhead in the language is typically not to blame for performance problems. 在比较高级语言时,语言的开销通常不应归咎于性能问题。 Typically, the problem is your algorithm. 通常,问题是您的算法。

I'm not very familiar with R, but you may find speed-ups by reading larger chunks of data into memory at once vs smaller chunks (less system calls). 我对R不是很熟悉,但你可以通过一次读取更大的数据块到更小的块(更少的系统调用)来找到加速。 If R doesn't have the ability to change something like this, you will probably find that python can be much faster simply because of this ability. 如果R没有能力改变这样的东西,你可能会发现python因为这种能力可以快得多。

R data manipulation has rules for it to be fast. R数据操作具有快速的规则。 The basics are: 基础是:

  1. vectorize 矢量化
  2. use data.frames as little as possible (for example, in the end) 尽可能少地使用data.frames(例如,最后)

Search for R time optimization and profiling and you will find many resources to help you. 搜索R时间优化和分析,您将找到许多资源来帮助您。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM