简体   繁体   中英

python pandas how to read csv file by block

I'm trying to read a CSV file, block by block.

CSV looks like:

No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
No.,time,00:00:10,00:00:11,00:00:12,00:00:13,00:00:14,00:00:15,00:00:16,00:00:17,00:00:18,00:00:19,00:00:1A,...
1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,

Blocks start with No. , and data rows follow.

def run(sock, delay, zipobj):
   zf = zipfile.ZipFile(zipobj)
   for f in zf.namelist():
      print(zf.filename)
      print("csv name: ", f)
      df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5] #,"nrows=1435? (but for the next blocks?")
      print(df, '\n')
      date_pattern='%Y/%m/%d %H:%M'
      df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
      tuples=[] # data will be saved in a list
      formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
      for each_column in list(df.columns)[2:-1]:
             for e in zip(list(df['epoch']),list(df[each_column])):
                 each_column=each_column.replace("X", '')
                 #print(f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e)
                 tuples.append((f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e))
      package = pickle.dumps(tuples, 1)
      size = struct.pack('!L', len(package))
      sock.sendall(size)
      sock.sendall(package)
      time.sleep(delay)

Many thanks for help,

Load your file with pd.read_csv and create block at each time the row of your first column is No. . Use groupby to iterate over each block and create a new dataframe.

data = pd.read_csv('data.csv', header=None)
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
    df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
    dfs.append(df.rename_axis(columns=None))

Output:

# First block
>>> dfs[0]
    No.              time 00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:0A  ...
0     1  2021/09/12 02:16      235      610      345      997      446      130      129       94      555      274        4  NaN
1     2  2021/09/12 02:17      364      210      371      341      294       87      179      106      425      262        3  NaN
2  1434  2021/09/12 02:28      269      135      372      262      307       73       86       93      512      283        4  NaN
3  1435  2021/09/12 02:29      281      207      688      322      233       75       69       85      663      276        2  NaN


# Second block
>>> dfs[1]
    No.              time 00:00:10 00:00:11 00:00:12 00:00:13 00:00:14 00:00:15 00:00:16 00:00:17 00:00:18 00:00:19 00:00:1A  ...
0     1  2021/09/12 02:16      255      619      200      100      453      456        4       19       56       23        4  NaN
1     2  2021/09/12 02:17      368       21       37       31       24        8       19     1006     4205     2062       30  NaN
2  1434  2021/09/12 02:28     2689     1835     3782     2682      307      743      256      741       52       23        6  NaN
3  1435  2021/09/12 02:29     2281     2047     6848     3522     2353      755      659      885     6863       26       36  NaN

and so on.

Sorry, i don't find a correct way with your code:

def run(sock, delay, zipobj):
   zf = zipfile.ZipFile(zipobj)
   for f in zf.namelist():
      print("using zip :", zf.filename)
      str = f
      myobject = re.search(r'(^[a-zA-Z]{4})_.*', str)
      Objects = myobject.group(1)
      if Objects  == 'LDEV':
         metric = re.search('.*LDEV_(.*)/.*', str)
         metric = metric.group(1)
      elif Objects  == 'Port':
         metric = re.search('.*/(Port_.*).csv', str)
         metric = metric.group(1)
      else:
         print("None")
      print("using csv : ", f)
      #df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5])
      data = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5])
      dfs = []
      for _, df in data.groupby(data[0].eq('No.').cumsum()):
         df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
         dfs.append(df.rename_axis(columns=None))
         print("here")
         date_pattern='%Y/%m/%d %H:%M'
         dfs['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
         tuples=[] # data will be saved in a list
         #formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
         for each_column in list(df.columns)[2:-1]:
                for e in zip(list(df['epoch']),list(df[each_column])):
                    each_column=each_column.replace("X", '')
                    tuples.append((f"perf.type.serial.{Objects}.{each_column}.{metric}",e))
      package = pickle.dumps(tuples, 1)
      size = struct.pack('!L', len(package))
      sock.sendall(size)
      sock.sendall(package)
      time.sleep(delay)

thanks for your help,

I recommend processing the "blocks" of data in one pass, without Pandas.

Once the data is separated from the block "headers" you can run the normalized CSV through Pandas and do the all column/cell processing/transformations you need.

I'm starting with your sample data:

input.csv

No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
No.,time,00:00:10,00:00:11,00:00:12,00:00:13,00:00:14,00:00:15,00:00:16,00:00:17,00:00:18,00:00:19,00:00:1A,...
1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,

I run this "phase one" script (which captures which block the data row came from, so you can check to make sure it's working correctly):

import csv

block = 0
header = None
rows = []
with open('input.csv', newline='') as f:
    reader = csv.reader(f)
    header = next(reader)  # manually read first row to get "header"
    block += 1             # this was the first block

    for row in reader:
        # Single out rows which are your "block" delimiters
        if row[0] == 'No.':
            block += 1
            continue  # to next row, which should be "data"

        # Append only "data" rows, including block no. for visual check
        rows.append([block] + row)


with open('no_blocks.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Block'] + header)
    writer.writerows(rows)

I get this for no_blocks.csv :

Block,No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
1,2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1,1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1,1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
2,1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
2,1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
2,1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,

Now you can run another script, with Pandas, on no_blocks.csv to transform your data without having to include the logic for skipping rows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM