[英]How to separate monthly and daily filenames on python list into two separate lists?
我有一个清单:
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv'
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv'...]
如何将每月和每天的文件分成两个单独的列表?
期望的输出:
daily = ['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_01-03-2020.csv']
monthly = ['https://myurl.com/something//something_03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
我正在尝试波纹管但没有成功:
daily = [ x for x in allFiles if "%m-%Y.csv" not in x ]
有人可以帮忙吗? 先感谢您!
这是一个使用regex
来识别每日和每月日期模式的解决方案,
import re
daily_pattern = re.compile(r"\d{2}-\d{2}-\d{4}.csv")
monthly_pattern = re.compile(r"\d{2}-\d{4}.csv")
monthly, daily = [], []
for f in allFiles:
if daily_pattern.search(f):
daily.append(f)
elif monthly_pattern.search(f):
monthly.append(f)
else:
print('invalid pattern %s' % f)
您可以拆分 url 以仅获取您想要的部分,然后计算连字符以查看日期的格式:
monthly = []
daily = []
for url in all_files:
# splits the url string by '/', returns only the part after the last '/'
filename = url.rsplit('/', 1)[-1]
# same as before but split by '_' and getting only similar to 01-01-2020.csv
datestring = filename.rsplit('_', 1)[-1]
datestring_hyphens = datestring.count('-')
if datestring_hyphens == 1:
monthly.append(datestring)
elif date_string_hyphens == 2:
daily.append(datestring)
首先创建一个允许对 URL 进行排序的函数,以便对那些是天和那些是月进行分类
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv'
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
def month_or_day(string):
return len(string.split('_')[1].split(".")[0].split('-'))
然后创建一个数据框以将此函数应用于每个 URL
df=pd.DataFrame(allFiles,columns=['URL'])
df['Month_day']=0
df['intermediate'] = pd.Series(allFiles).apply(lambda x : month_or_day(x))
你得到的 URLS 分离如下:
print('Month : ',df[df['intermediate']==2]['URL'].tolist())
print('')
print('Day : ',df[df['intermediate']==3]['URL'].tolist())
使用正则表达式:
import re
daily_pattern = r"""
^ # Start of string
.+? # Match anything except newline (not greedy)
\d{2} # Two numerical values.
- # Hyphen
\d{2} # Two numerical values.
- # Hyphen
\d{4} # Four numerical values.
\.\w+ # File extension with escaped period.
$ # End of string
"""
# Compile with re.M (ignore case) and re.X (handle pattern verbosity)
p = re.compile(daily_pattern, flags=re.I | re.X)
daily = [f for f in allFiles if p.match(f)]
monthly = [f for f in allFiles if not f in daily]
编辑:更新以包含更多解释。
可能是这样的
month_files = [f for f in allFiles if len(f.rpartition('_')[2].split('-'))==2]
day_files = [f for f in allFiles if len(f.rpartition('_')[2].split('-'))==3]
rpartition
将在_
上拆分文件,并为您提供 3 个项目,例如['somename','_','the date/month .csv']
您可以通过拆分和长度检查过滤日期部分。
使用rpartition
即使文件名有多个_
它也会工作。
您可以在此处使用正则表达式
前任:
import re
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv',
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
daily = []
monthly = []
for i in allFiles:
if re.search(r"_(\d+\-\d+\.csv)$", i):
monthly.append(i)
else:
daily.append(i)
print(daily)
print(monthly)
输出:
['https://myurl.com/something//something_01-01-2020.csv', 'https://myurl.com/something//something_01-02-2020.csv', 'https://myurl.com/something//something_01-03-2020.csv']
['https://myurl.com/something//something_03-2020.csv', 'https://myurl.com/something//something_04-2020.csv']
假设文件名中没有其他“_”:
monthly = [file for file in allFiles if len(file.split('_')[1].split('-')) == 2]
daily = [file for file in allFiles if len(file.split('_')[1].split('-')) == 3]
请注意,您的示例中有一个错误,缺少逗号。
您的问题非常适合正则表达式。
import re
allFiles =['https://myurl.com/something//something_01-01-2020.csv',
'https://myurl.com/something//something_01-02-2020.csv',
'https://myurl.com/something//something_03-2020.csv',
'https://myurl.com/something//something_01-03-2020.csv',
'https://myurl.com/something//something_04-2020.csv']
dailyRegexp = re.compile(r".*\d\d-\d\d-\d\d\d\d\.csv$")
isDaily = lambda fn: dailyRegexp.match(fn)
daily = [fn for fn in allFiles if isDaily(fn)]
monthly = [fn for fn in allFiles if not isDaily(fn)]
print("Daily:", daily)
print("Monthly:", monthly)
正则表达式的解释:
.*
是任意字符 ( .
),重复任意次 ( *
)\\d
是任何数字-
只是字面意思-
,没有特殊含义\\.
是一个点字符(用反斜杠转义以防止特殊含义)csv
是文字字符串,没有特殊含义$
是字符串的结尾还要注意字符串前的r
。 它表示防止 Python 将\\
解释为特殊字符的原始字符串。 更多信息:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.