Extract substring from list of file names in Python or R

Question

My question is very similar to the following: How to get a Substring from list of file names . I'm a newb to Python and would prefer a similar solution for Python (or R). I'd like to look into a directory and extract a particular substring from each applicable file name and output it as a vector (preferred), list, or array. For example, assume I have directory with the following file names:

data_ABC_48P.txt
data_DEF_48P.txt
data_GHI_48P.txt
other_96.txt
another_98.txt

I would like to reference the directory and extract the following as a character vector (for use in R) or list:

"ABC", "DEF", "GHI"

I tried the following:

from os import listdir
from os.path import isfile, join
files = [ f for f in listdir(path) if isfile(join(path,f)) ]
import re
m = re.search('data_(.+?)_48P', files)

But I get the following error:

TypeError: expected string or buffer

files is of type list

In [10]: type(files)
Out[10]: list

Even though I ultimately want this character vector as an input to R code, we are trying to transition all of our "scripting" to Python and use R solely for data analysis, so a Python solution would be great. I'm also using Ubuntu, so a cmd line or bash script solution could work as well. Thanks in advance!

Answer 1

Use List comprehension like,

[re.search(r'data_(.+?)_48P', i).group(1) for i in files if re.search(r'data_.+?_48P', i)]

You need to iterate over the list contents inorder to grab the substrings you want.

Answer 2

re.search requires string not list.

Use

m=[]
for line in files:
   import re
   m.append(re.search('data_(.+?)_48P', line).group(1))

Answer 3

re.search() dont accept a list as argument you need to use a loop and pass every element that must be string to the function , you can use positive look-around for give your expected string then as the result of re.search is a generator you need group to get the string

>>> for i in files :
...   try :
...    print re.search(r'(?<=data_).*(?=_48P)', i).group(0)
...   except AttributeError:
...    pass
... 
ABC
DEF
GHI

Answer 4

from os import listdir
from os.path import isfile, join
import re
strings = []
for f in listdir(path):
    if isfile(join(path,f)):
        m = re.search('data_(.+?)_48P', f)
        if m:
            strings.append(m.group(1))

print strings

Output:

['ABC', 'DEF', 'GHI']

Answer 5

In R:

list.files('~/desktop/test')
# [1] "another_98.txt"   "data_ABC_48P.txt" "data_DEF_48P.txt" "data_GHI_48P.txt" "other_96.txt"

gsub('_', '', unlist(regmatches(l <- list.files('~/desktop/test'),
                                gregexpr('_(\\w+?)_', l, perl = TRUE))))
# [1] "ABC" "DEF" "GHI"

another way:

l <- list.files('~/desktop/test', pattern = '_(\\w+?)_')

sapply(strsplit(l, '[_]'), '[[', 2)
# [1] "ABC" "DEF" "GHI"

Extract substring from list of file names in Python or R

Question

5 answers

solution1
2 ACCPTED 2014-12-05 17:17:36

solution2
0 2014-12-05 17:15:36

solution3
0 2014-12-05 17:19:20

solution4
0 2014-12-05 17:25:35

solution5
0 2014-12-05 17:49:18

Extract substring from list of file names in Python or R

Question

5 answers

solution1 2 ACCPTED 2014-12-05 17:17:36

solution2 0 2014-12-05 17:15:36

solution3 0 2014-12-05 17:19:20

solution4 0 2014-12-05 17:25:35

solution5 0 2014-12-05 17:49:18

solution1
2 ACCPTED 2014-12-05 17:17:36

solution2
0 2014-12-05 17:15:36

solution3
0 2014-12-05 17:19:20

solution4
0 2014-12-05 17:25:35

solution5
0 2014-12-05 17:49:18