简体   繁体   English

修改 R (RStudio) 中的 LexisNexisTools 以重命名文件

[英]Modifying LexisNexisTools in R (RStudio) to rename files

I'm trying to rename.txt files in a directory downloaded from Nexis Advance UK.我正在尝试重命名从 Nexis Advance UK 下载的目录中的文件。 Being unfamiliar with coding, I set about trying to modify LexisNexisTools' code in RStudio.由于不熟悉编码,我开始尝试修改LexisNexisTools 在 RStudio 中的代码。

What I've done is change term.v <- content_v[grep("^Terms: |^Begriffe: ", content_v)] to term.v <- content_v[grep("The Guardian(London)", fixed = T, content_v)] , for instance, and changed the rename function so that it only pastes term.v .我所做的是将term.v <- content_v[grep("^Terms: |^Begriffe: ", content_v)]更改为term.v <- content_v[grep("The Guardian(London)", fixed = T, content_v)] ,例如,并更改了重命名 function 以便它只粘贴term.v However, I'm trying to retain the original OR function so that the code would cycle through a number of strings such "Express Online" or "The Independent (United Kingdom)" and then paste the string found into the file rename function.但是,我试图保留原始的OR function,以便代码循环遍历多个字符串,例如“Express Online”或“The Independent(英国)”,然后将找到的字符串粘贴到文件中重命名为 function。

Here is what I've tried so far:到目前为止,这是我尝试过的:

1 - Use regular expressions (from what I could gather online on regular expressions with spaces in strings) with fixed = F , such as "^The/sGuardian(London)$|^Express/sOnline$" 1 - 使用fixed = F的正则表达式(从我可以在线收集的字符串中带有空格的正则表达式),例如"^The/sGuardian(London)$|^Express/sOnline$"

2- I've tried using a vector to "house" the different patterns and then paste the vector in the grep command 2-我尝试使用矢量来“容纳”不同的模式,然后将矢量粘贴到 grep 命令中

toMatch.v <- c("Express Online", "The Times (London)", "The Independent (United Kingdom)" 

term.v<- content_v[grep(paste(toMatch, collapse="|"),  content_v)]

The only time the code (as modified) works is when fixed = T and the string is typed as is found in the.txt files.代码(修改后的)唯一有效的时间是fixed = T并且按照在.txt 文件中找到的字符串键入。

What am I doing wrong?我究竟做错了什么? Thank you so much and I apologize if the terminology isn't accurate.非常感谢,如果术语不准确,我深表歉意。

Extra details:额外细节:

Originally, the code relies on a set of keywords to find the search term and insert it into the file's name:最初,代码依赖一组关键字来查找搜索词并将其插入文件名中:

    content_v <- readLines(files[i], encoding = encoding, n = 50)
    term.v <- content_v[grep("^Terms: |^Begriffe: ", content_v)]
    # erase everything in the line exept the actual range
    term.v <- gsub("^Terms: |^Begriffe: ", "", term.v)
    # split term into elemets seprated by and or OR
    term.v <- unlist(strsplit(term.v, split = " AND | and | OR ", fixed = FALSE))

I have changed it so that grep begins with the string that I want to append to the filename, as explained above.我已经更改了它,以便grep以我想要 append 到文件名的字符串开头,如上所述。 I have also disabled the gsub line and changed the split argument to "/n" as the string in my text files is separated with line breaks.我还禁用了gsub行并将split参数更改为"/n" ,因为我的文本文件中的字符串用换行符分隔。 Here is an example of a sample.txt file.以下是 sample.txt 文件的示例。

Assuming that you have a file file1.txt with something like the following content in your working directory:假设您的工作目录中有一个文件file1.txt ,其内容类似于以下内容:

foo
foo bar Express Online
bar

Then, the following code should rename the file into Express Online.txt .然后,以下代码应将文件重命名为Express Online.txt

file1 <- "file1.txt"

text1 <- readLines(file(file1))

# if (any(grepl("The Guardian", text1))) {
#     file.rename(file1, "The Guardian.txt")
# } else if (any(grepl("Express Online", text1))) {
#     file.rename(file1, "Express Online.txt")
# }

newname <- head(
    n = 1,
    na.omit(
        stringr::str_extract(
            text1,
            "(Express Online)|(The Times \\(London\\))|(The Independent \\(United Kingdom\\))")))

file.rename(file1, paste0(newname, ".txt"))

Unfortunatly your file format is quite different from how the files looked when I wrote LexisNexisTools .不幸的是,您的文件格式与我编写LexisNexisTools时文件的外观完全不同。 And so are your requirements.您的要求也是如此。 So I would write new code to do the job here.所以我会在这里编写新代码来完成这项工作。 First, let's try for one file:首先,让我们尝试一个文件:

f <- "/home/johannes/Documents/x.txt"
lines <- readLines(f)

toMatch.v <- c("Express Online", "The Times (London)", "The Independent (United Kingdom)")

# I'm using another function from the package to convenietly look up a several patterns at once
np <- unlist(LexisNexisTools::lnt_lookup(lines, toMatch.v, verbose = FALSE))[1]
new_name <- paste0(dirname(f), "/", np, ".txt")
new_name
#> [1] "/home/johannes/Documents/Express Online.txt"

file.rename(f, new_name)

Once this works for you as intended, you can implement it for a number of files.一旦这按预期为您工作,您就可以为许多文件实现它。 As in my original function I would suggest you write your new names in a data.frame first so you can check if the new names make sense and if you have duplicates in the new names ( R would write both files into the new name without warning and destroy one file in that case):与我原来的 function 一样,我建议您先在 data.frame 中写入新名称,以便检查新名称是否有意义以及新名称中是否有重复项( R会将两个文件写入新名称而不会发出警告并在这种情况下销毁一个文件):

files <- list.files("/home/johannes/Documents/", pattern = ".txt$", 
                    ignore.case = TRUE, full.names = TRUE)

make_new_name <- function(old_name) {
  lines <- readLines(old_name)

  np <- unlist(LexisNexisTools::lnt_lookup(lines, toMatch.v, verbose = FALSE))[1]
  paste0(dirname(old_name), "/", np, ".txt")
}  

df <- tibble::tibble(
  old = files,
  new = sapply(files, make_new_name)
)
df               
#> # A tibble: 2 x 2
#>   old                              new                                        
#>   <chr>                            <chr>                                      
#> 1 /home/johannes/Documents//x.txt  /home/johannes/Documents/Express Online.txt
#> 2 /home/johannes/Documents//x2.txt /home/johannes/Documents/Express Online.txt

If the new names make sense to you and if there are no duplicates ( table(duplicated(df$new)) ), you can pull the trigger and rename the files:如果新名称对您有意义并且没有重复( table(duplicated(df$new)) ),您可以拉动触发器并重命名文件:

file.rename(df$old, df$new)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM