將具有多列的 csv 文件加載到多個 dataframe

Question

我正在嘗試加載一些 csv 大文件，這些文件似乎有多個列，我正在努力解決它。

我不知道這些 csv 文件是誰設計的，但它們似乎在每個 csv 中都有事件數據和日志數據。在每個 csv 文件的開頭也有一些初始狀態留置權

一切都在單獨的行中事件數據使用 2 列（數據和事件注釋）日志數據有多個列（日期和 20 多列。

我在下面給出了數據設置類型的示例：

Initial; [Status] The Zoo is Closed;
Initial; Status] The Sun is Down;
Initial; [Status] Monkeys ar sleeping;

Time;No._Of_Monkeys;Monkeys_inside;Monkeys_Outside;Number_of_Bananas
06:00; 5;5;0;10
07:00; 5;5;0;10
07:10;[Event] Sun is up
08:00; 5;5;0;10
08:30; [Event] Monkey Doors open and Zoo Opens
09:00; 5;5;0;10
08:30; [Event] Monkey Goes out
09:00; 5;4;1;10
08:30; [Event] Monkey Eats Banana
09:00; 5;4;1;9
08:30; [Event] Monkey Goes out
09:00; 5;5;2;9

現在我要做的是將日志數據放入一個數據框中，將初始數據和事件數據放入另一個數據框中。

現在我可以使用 csv_reader 和 go 逐行讀取 csv 文件，但這證明非常慢，尤其是在嘗試 go 徹底多個文件並且每個文件包含大約 40k 行時

下面是我在下面使用的代碼

csv_files = [f for f in os.listdir('.') if f.endswith('.log')]


for file in csv_files:

# Open the CSV file in read mode
  with open(file, 'r') as csv_file:
    # Use the csv module to parse the file
    csv_reader = csv.reader(csv_file, delimiter=';')

    # Loop through the rows of the file
    for row in csv_reader:
      # If the row has event data
      if len(row) == 2:
        # Add the row to the Eventlog
          EventLog = EventLog.append(pd.Series(row), ignore_index=True)
      # If the row is separated by a single separator
      elif len(row) > 2:
        #First row entered into data log will be the column headers
        if DataLog.empty:
          DataLog=pd.DataFrame(columns=row)
        else:
        # Add the row to the single_separator_df DataFrame
          DataLog = DataLog.append(pd.Series(row), ignore_index=True)

有沒有更好的方法來做到這一點......最好更快

如果我使用 pandas read_csv，它似乎只加載初始數據。 即上面我的數據的前 3 行。 我可以使用跳過行跳到數據所在的位置，然后它將加載 rest，但我看不出如何從那里分離出事件和日志數據

所以在我失去剩下的頭發之前尋找想法。

Answer 1

如果我正確地理解了你的數據格式，我會做這樣的事情：

# simply read data as one column data without headers and indexes
df = pd.read_csv("your_file_name.log", header=None, sep=',')
# split values in this column by ; (in each row will be list of values)
tmp_df = df[0].str.split(";")

# delete empty values in the first 3 rows (because we have ; in the end of these rows)
tmp_df = tmp_df.map(lambda x: [y for y in x if y != ''])
# those rows which have 2 values we insert in one dataframe 
EventLog = pd.DataFrame(tmp_df[tmp_df.str.len() == 2].to_list())
# other ones we inset in another dataframe (in the first row will be column names)
data_log_tmp = tmp_df[tmp_df.str.len() != 2].to_list()
DataLog = pd.DataFrame(data_log_tmp[1:], columns=data_log_tmp[0])

Answer 2

這是一個加載 CSV 文件的示例，假設Monkeys_inside字段在Event data中始終為NaN並在log data中分配，因為我將其用作檢索事件數據的條件：

df = pd.read_csv('huge_data.csv',  skiprows=3, sep=';')
log_df = df.dropna().reset_index(drop=True)
event_df = df[~df['Monkeys_inside'].notnull()].reset_index(drop=True)

並假設您的所有 CSV 文件都包含這 3 個狀態行。

請記住，如果您的 csv 文件中有一些重復的行，則 dataframe 將保留重復的行，要刪除它們，您只需調用drop_duplicates function 就可以了：

event_df = event_df.drop_duplicates()

將具有多列的 csv 文件加載到多個 dataframe

問題描述

2 個解決方案

解決方案1
1 已采納 2023-01-09 22:29:24

解決方案2
0 2023-01-09 23:25:30

將具有多列的 csv 文件加載到多個 dataframe

問題描述

2 個解決方案

解決方案1 1 已采納 2023-01-09 22:29:24

解決方案2 0 2023-01-09 23:25:30

解決方案1
1 已采納 2023-01-09 22:29:24

解決方案2
0 2023-01-09 23:25:30