简体   繁体   English

根据内容确定R中的文件类型

[英]Determine file type in R based on the content

In linux we can use file command to get the file type based on the content of the file ( not extension ).在 linux 中,我们可以使用file命令根据文件的内容(不是扩展名)获取文件类型。 Is there any similar function in R? R中是否有类似的功能?

Old question but maybe relevant for people getting here via google: You can use dqmagic , a wrapper around libmagic for R, to determine the file type based on the files content.老问题,但可能与通过 google 到达这里的人有关:您可以使用dqmagic ,它是 R 的 libmagic 的包装器,根据文件内容确定文件类型。 Since file uses the same library, the results are the same, eg:由于file使用相同的库,因此结果相同,例如:

library(dqmagic)
file_type("DESCRIPTION")
#> [1] "ASCII text"
file_type("src/file.cpp")
#> [1] "C source, ASCII text"

vs.对比

$ file DESCRIPTION src/file.cpp 
DESCRIPTION:  ASCII text
src/file.cpp: C source, ASCII text

Disclaimer: I am the author of the package.免责声明:我是该软件包的作者。

dqmagic is not on CRAN. dqmagic 不在 CRAN 上。 Below an R solution which uses linux's "file" command (actually BSD's 'file' v5.35 dated October 2018, packaged in Ubuntu 19.04, according to man page)在使用 linux 的“file”命令的 R 解决方案下方(根据手册页,实际上是 2018 年 10 月的 BSD 的“文件”v5.35,打包在 Ubuntu 19.04 中)

file_full_path <- "/home/user/Documents/an_RTF_document.doc"
file_mime_type <- system2(command = "file",
  args = paste0(" -b --mime-type ", file_full_path), stdout = TRUE) # "text/rtf"
# Gives the list of potentially allowed extension for this mime type:
file_possible_ext <- system2(command = "file",
  args = paste0(" -b --extension ", file_full_path),
  stdout = TRUE) # "???". "doc/dot" for MsWord files.

It could be necessary to check that the actual extension is known to be a valid extension for the given mime type (for instance, readtext::readtext() reads an RTF file but fails if it is saved as *.doc).可能需要检查实际扩展名是否是给定 mime 类型的有效扩展名(例如, readtext::readtext() 读取 RTF 文件,但如果将其保存为 *.doc 则失败)。

file.basename <- basename(file_full_path)
file.base_without_ext <-sub(pattern = "(.*)\\..*$",
  replacement = "\\1", file.basename)
file.nchar_ext <- nchar(file.basename) - 
  nchar(file.base_without_ext)-1 # 3 or 4 (doc, docx, odt...)
file_ext <- substring(file.basename, nchar(file.basename) -
  file.nchar_ext +1) # doc, rtf...
if (file_mime_type == "text/rtf"){
   file_possible_ext <- "rtf"
} # in some (all?) cases, for an rtf mime-type, 
  #'file' outputs "???" as allowed extension

# Returns TRUE if the actual extension is known to 
# be a valid extension for the given mime type:
length(grep(file_ext, file_possible_ext, ignore.case = TRUE)) > 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM