简体   繁体   English

将具有多列的 csv 文件加载到多个 dataframe

[英]Load csv files with multiple columns into several dataframe

I am trying to load some large csv files which appear to have multiple columns and I am struggling with it.我正在尝试加载一些 csv 大文件,这些文件似乎有多个列,我正在努力解决它。

I don't know who design these csv files, but they appear to have event data as well as log data in each csv. At the start of each csv file there is some initial status liens as well我不知道这些 csv 文件是谁设计的,但它们似乎在每个 csv 中都有事件数据和日志数据。在每个 csv 文件的开头也有一些初始状态留置权

Everything is in a separate rows The Event data uses 2 columns (Data and Event comment) The Log data has multiple columns( Date and 20+ columns.一切都在单独的行中事件数据使用 2 列(数据和事件注释)日志数据有多个列(日期和 20 多列。

I give an example of the type of data setup below:我在下面给出了数据设置类型的示例:

Initial; [Status] The Zoo is Closed;
Initial; Status] The Sun is Down;
Initial; [Status] Monkeys ar sleeping;

Time;No._Of_Monkeys;Monkeys_inside;Monkeys_Outside;Number_of_Bananas
06:00; 5;5;0;10
07:00; 5;5;0;10
07:10;[Event] Sun is up
08:00; 5;5;0;10
08:30; [Event] Monkey Doors open and Zoo Opens
09:00; 5;5;0;10
08:30; [Event] Monkey Goes out
09:00; 5;4;1;10
08:30; [Event] Monkey Eats Banana
09:00; 5;4;1;9
08:30; [Event] Monkey Goes out
09:00; 5;5;2;9

Now what I want to do is to put the Log data into one data frame and the Initial and Event data into another.现在我要做的是将日志数据放入一个数据框中,将初始数据和事件数据放入另一个数据框中。

Now I can read the csv files with csv_reader and go row by row but this is proving very slow, especially when trying to go thorough multiple files and each file containing about 40k rows现在我可以使用 csv_reader 和 go 逐行读取 csv 文件,但这证明非常慢,尤其是在尝试 go 彻底多个文件并且每个文件包含大约 40k 行时

Below is code I am using below下面是我在下面使用的代码

csv_files = [f for f in os.listdir('.') if f.endswith('.log')]


for file in csv_files:

# Open the CSV file in read mode
  with open(file, 'r') as csv_file:
    # Use the csv module to parse the file
    csv_reader = csv.reader(csv_file, delimiter=';')

    # Loop through the rows of the file
    for row in csv_reader:
      # If the row has event data
      if len(row) == 2:
        # Add the row to the Eventlog
          EventLog = EventLog.append(pd.Series(row), ignore_index=True)
      # If the row is separated by a single separator
      elif len(row) > 2:
        #First row entered into data log will be the column headers
        if DataLog.empty:
          DataLog=pd.DataFrame(columns=row)
        else:
        # Add the row to the single_separator_df DataFrame
          DataLog = DataLog.append(pd.Series(row), ignore_index=True)

Is there a better way to do this....preferably faster有没有更好的方法来做到这一点......最好更快

IF I use pandas read_csv it seems to only load the Initial data.如果我使用 pandas read_csv,它似乎只加载初始数据。 ie first 3 lines of my data above.即上面我的数据的前 3 行。 I can use skip rows to skip down to where the data is and then it will load the rest, but I can't see to figure out how to separate out the event and log data from there我可以使用跳过行跳到数据所在的位置,然后它将加载 rest,但我看不出如何从那里分离出事件和日志数据

so looking for ideas before i lose what little hair I have left.所以在我失去剩下的头发之前寻找想法。

If I understood your data format corectly, I would do something like this:如果我正确地理解了你的数据格式,我会做这样的事情:

# simply read data as one column data without headers and indexes
df = pd.read_csv("your_file_name.log", header=None, sep=',')
# split values in this column by ; (in each row will be list of values)
tmp_df = df[0].str.split(";")

# delete empty values in the first 3 rows (because we have ; in the end of these rows)
tmp_df = tmp_df.map(lambda x: [y for y in x if y != ''])
# those rows which have 2 values we insert in one dataframe 
EventLog = pd.DataFrame(tmp_df[tmp_df.str.len() == 2].to_list())
# other ones we inset in another dataframe (in the first row will be column names)
data_log_tmp = tmp_df[tmp_df.str.len() != 2].to_list()
DataLog = pd.DataFrame(data_log_tmp[1:], columns=data_log_tmp[0])

Here is an example of loading a CSV file, assuming that Monkeys_inside field is always NaN in Event data and assigned in log data , because I used it as a condition to retrieve the event data:这是一个加载 CSV 文件的示例,假设Monkeys_inside字段在Event data中始终为NaN并在log data中分配,因为我将其用作检索事件数据的条件:

df = pd.read_csv('huge_data.csv',  skiprows=3, sep=';')
log_df = df.dropna().reset_index(drop=True)
event_df = df[~df['Monkeys_inside'].notnull()].reset_index(drop=True)

And assuming also that all your CSV file contains those 3 Status lines.并假设您的所有 CSV 文件都包含这 3 个状态行。

Keep in mind that the dataframe will hold duplicated rows if you have some in your csv files, to remove them, you need just to call the drop_duplicates function and you good:请记住,如果您的 csv 文件中有一些重复的行,则 dataframe 将保留重复的行,要删除它们,您只需调用drop_duplicates function 就可以了:

event_df = event_df.drop_duplicates()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM