簡體   English   中英

熊貓用正則表達式讀取csv

[英]pandas read csv with regex

我有一個文件夾trip_data ,其中包含許多帶日期的csv文件,如下所示:

trip_data/
├── df_trip_20140803_1.csv
├── df_trip_20140803_2.csv
├── df_trip_20140803_3.csv
├── df_trip_20140803_4.csv
├── df_trip_20140803_5.csv
├── df_trip_20140803_6.csv
├── df_trip_20140804_1.csv
├── df_trip_20140804_2.csv
├── df_trip_20140804_3.csv
├── df_trip_20140804_4.csv
├── df_trip_20140804_5.csv
├── df_trip_20140804_6.csv
├── df_trip_20140805_1.csv
├── df_trip_20140805_2.csv
├── df_trip_20140805_3.csv
├── df_trip_20140805_4.csv
├── df_trip_20140805_5.csv
├── df_trip_20140805_6.csv
├── df_trip_20140806_1.csv
├── df_trip_20140806_2.csv
├── df_trip_20140806_3.csv
├── df_trip_20140806_4.csv

現在我想按日期分別使用python pandas加載所有這些文件,意味着4個DataFrame df_traip_20140803, df_traip_20140804, df_traip_20140805, df_traip_20140806

我的代碼如下所示:

days = [20140803,20140804,20140805,20140806]

for day in days:
    ## Locate to the path
    path ='./trip_data/df_trip_%d*.csv' % day
    df = pd.read_csv(path, header=None, nrows=10,
                        names=['ID','lat','lon','status','timestamp']) 

無法獲得正確的結果。 我怎樣才能做到這一點?

我將所有這些CSV收集到具有以下結構的DataFrames字典中:

df['20140803'] -DF,包含屬於所有df_trip_20140803_*.csv CSV文件的串聯數據。

解:

import os
import re
import glob
import pandas as pd

fpattern = r'D:\temp\.data\41444939\df_trip_{}_{}.csv'
files = glob.glob(fpattern.format('*','*'))

dates = sorted(set([re.split(r'_(\d{8})_(\d+)\.(\w+)', f)[1] for f in files]))

dfs = {}
for d in dates:
    dfs[d] = pd.concat((pd.read_csv(f) for f in glob.glob(fpattern.format(d, '*'))), ignore_index=True)

測試:

In [95]: dfs.keys()
Out[95]: dict_keys(['20140804', '20140805', '20140803', '20140806'])

In [96]: dfs['20140803']
Out[96]:
    a  b  c
0   0  0  7
1   3  7  1
2   9  7  3
3   7  4  7
4   5  2  4
5   0  0  4
6   7  2  2
7   8  4  1
8   0  8  3
9   3  9  0
10  7  3  9
11  1  9  8
12  6  7  2
13  3  8  1
14  3  4  5
15  0  9  2
16  5  8  7
17  8  5  4
18  2  0  2
19  9  6  6
20  6  6  6
21  2  6  9
22  1  0  8
23  3  1  1
24  7  4  2
25  7  4  2
26  8  3  7
27  7  3  2
28  1  7  7
29  3  6  5

設定:

fn = r'D:\temp\.data\41444939\a.txt'
base_dir = r'D:\temp\.data\41444939'
files = open(fn).read().splitlines()
for f in files:
    pd.DataFrame(np.random.randint(0, 10, (5, 3)), columns=list('abc')) \
      .to_csv(os.path.join(base_dir, f), index=False)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM