简体   繁体   中英

Extract substring from list of file names in Python or R

My question is very similar to the following: How to get a Substring from list of file names . I'm a newb to Python and would prefer a similar solution for Python (or R). I'd like to look into a directory and extract a particular substring from each applicable file name and output it as a vector (preferred), list, or array. For example, assume I have directory with the following file names:

data_ABC_48P.txt
data_DEF_48P.txt
data_GHI_48P.txt
other_96.txt
another_98.txt

I would like to reference the directory and extract the following as a character vector (for use in R) or list:

"ABC", "DEF", "GHI"

I tried the following:

from os import listdir
from os.path import isfile, join
files = [ f for f in listdir(path) if isfile(join(path,f)) ]
import re
m = re.search('data_(.+?)_48P', files)

But I get the following error:

TypeError: expected string or buffer

files is of type list

In [10]: type(files)
Out[10]: list

Even though I ultimately want this character vector as an input to R code, we are trying to transition all of our "scripting" to Python and use R solely for data analysis, so a Python solution would be great. I'm also using Ubuntu, so a cmd line or bash script solution could work as well. Thanks in advance!

Use List comprehension like,

[re.search(r'data_(.+?)_48P', i).group(1) for i in files if re.search(r'data_.+?_48P', i)]

You need to iterate over the list contents inorder to grab the substrings you want.

re.search requires string not list.

Use

m=[]
for line in files:
   import re
   m.append(re.search('data_(.+?)_48P', line).group(1))

re.search() dont accept a list as argument you need to use a loop and pass every element that must be string to the function , you can use positive look-around for give your expected string then as the result of re.search is a generator you need group to get the string

>>> for i in files :
...   try :
...    print re.search(r'(?<=data_).*(?=_48P)', i).group(0)
...   except AttributeError:
...    pass
... 
ABC
DEF
GHI
from os import listdir
from os.path import isfile, join
import re
strings = []
for f in listdir(path):
    if isfile(join(path,f)):
        m = re.search('data_(.+?)_48P', f)
        if m:
            strings.append(m.group(1))

print strings

Output:

['ABC', 'DEF', 'GHI']

In R:

list.files('~/desktop/test')
# [1] "another_98.txt"   "data_ABC_48P.txt" "data_DEF_48P.txt" "data_GHI_48P.txt" "other_96.txt"

gsub('_', '', unlist(regmatches(l <- list.files('~/desktop/test'),
                                gregexpr('_(\\w+?)_', l, perl = TRUE))))
# [1] "ABC" "DEF" "GHI"

another way:

l <- list.files('~/desktop/test', pattern = '_(\\w+?)_')

sapply(strsplit(l, '[_]'), '[[', 2)
# [1] "ABC" "DEF" "GHI"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM