DataFrame中的熊貓格式化列並添加timedelta Index錯誤

Question

我正在嘗試使用panda對某些消息傳遞數據進行一些分析，並且在嘗試准備數據時遇到了一些問題。 它來自我無法控制的數據庫，因此在分析之前，需要進行一些修剪和格式化。

這是我到目前為止的位置：

#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)

在對時間戳進行更改並將其設置為索引之后，我不再使用to_csv了。

#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')

#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]

在這里，我需要格式化DATA列，然后得到一個“ SettingWithCopyWarning”。

#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
  if 'DATA' in col:
    datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)

datacolumns.to_csv('data_col.csv')


#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])

我需要弄清楚如何從給定組中獲取所有消息。 get_group（）要求我提前知道鍵值。

key_group = groups.get_group((1,1,1))
#foreach group in groups:
  #do analysis

我已經盡力解決了我遇到的問題，但似乎無法解決。 我確定這是我誤解/誤用了熊貓，因為我仍在弄清楚。

我希望解決這些問題：

1）我將時間戳索引添加為timedelta64后無法保存到csv

2）如何在重新格式化DATA列時將函數應用於一組列以刪除SettingWithCopyWarning。

3）如何在不使用get_group（）的情況下獲取每個組的行，因為我不提前知道鍵。

感謝您的見解和幫助，以便我更好地了解如何正確使用熊貓。

Answer 1

首先，您可以在查詢數據庫時設置索引列和解析日期：

indexed = pd.read_sql_query("Select * from message", engine=engine,
                            parse_dates='timestamp', index_col='timestamp')

請注意我用pd.read_sql_query而不是在這里pd.read_sql ，它被廢棄了，我想。

SettingWithCopy警告是由於數據datacolumns是已indexed的視圖的事實，即它的行/列的子集，而不是本身的對象。 看看這部分文檔： http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

解決這個問題的一種方法是定義

datacolumns = indexed[<cols>].copy()

另一個會做

indexed = indexed[<cols>]

如果您很高興不再需要它們，則可以有效地刪除不需要的列。 然后，您可以在閑暇時操作indexed 。 至於groupby，您可以引入一列元組，這將是組鍵：

indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
           lambda df: some_function(df.interaction_key, ...)

我不確定這是否正是您想要的，但是請告訴我，我可以編輯。

DataFrame中的熊貓格式化列並添加timedelta Index錯誤

問題描述

1 個解決方案

解決方案1
0 已采納 2015-06-09 20:23:34

DataFrame中的熊貓格式化列並添加timedelta Index錯誤

問題描述

1 個解決方案

解決方案1 0 已采納 2015-06-09 20:23:34

解決方案1
0 已采納 2015-06-09 20:23:34