My question is very similar to the following: How to get a Substring from list of file names . I'm a newb to Python and would prefer a similar solution for Python (or R). I'd like to look into a directory and extract a particular substring from each applicable file name and output it as a vector (preferred), list, or array. For example, assume I have directory with the following file names:
data_ABC_48P.txt
data_DEF_48P.txt
data_GHI_48P.txt
other_96.txt
another_98.txt
I would like to reference the directory and extract the following as a character vector (for use in R) or list:
"ABC", "DEF", "GHI"
I tried the following:
from os import listdir
from os.path import isfile, join
files = [ f for f in listdir(path) if isfile(join(path,f)) ]
import re
m = re.search('data_(.+?)_48P', files)
But I get the following error:
TypeError: expected string or buffer
files
is of type
list
In [10]: type(files)
Out[10]: list
Even though I ultimately want this character vector as an input to R code, we are trying to transition all of our "scripting" to Python and use R solely for data analysis, so a Python solution would be great. I'm also using Ubuntu, so a cmd line or bash script solution could work as well. Thanks in advance!
Use List comprehension like,
[re.search(r'data_(.+?)_48P', i).group(1) for i in files if re.search(r'data_.+?_48P', i)]
You need to iterate over the list contents inorder to grab the substrings you want.
re.search
requires string not list.
Use
m=[]
for line in files:
import re
m.append(re.search('data_(.+?)_48P', line).group(1))
re.search()
dont accept a list as argument you need to use a loop and pass every element that must be string to the function , you can use positive look-around for give your expected string then as the result of re.search
is a generator you need group
to get the string
>>> for i in files :
... try :
... print re.search(r'(?<=data_).*(?=_48P)', i).group(0)
... except AttributeError:
... pass
...
ABC
DEF
GHI
from os import listdir
from os.path import isfile, join
import re
strings = []
for f in listdir(path):
if isfile(join(path,f)):
m = re.search('data_(.+?)_48P', f)
if m:
strings.append(m.group(1))
print strings
Output:
['ABC', 'DEF', 'GHI']
In R:
list.files('~/desktop/test')
# [1] "another_98.txt" "data_ABC_48P.txt" "data_DEF_48P.txt" "data_GHI_48P.txt" "other_96.txt"
gsub('_', '', unlist(regmatches(l <- list.files('~/desktop/test'),
gregexpr('_(\\w+?)_', l, perl = TRUE))))
# [1] "ABC" "DEF" "GHI"
another way:
l <- list.files('~/desktop/test', pattern = '_(\\w+?)_')
sapply(strsplit(l, '[_]'), '[[', 2)
# [1] "ABC" "DEF" "GHI"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.