简体   繁体   English

来自现有DF的元信息的新熊猫数据框

[英]New pandas dataframe from meta information of existing DF

Currently have a CSV file that outputs a dateframe as follows: 当前具有一个输出日期框架的CSV文件,如下所示:

[in]
df = pd.read_csv(file_name)
df.sort('TOTAL_MONTHS', inplace=True)
print df[['TOTAL_MONTHS','COUNTEM']]

[out] 
    TOTAL_MONTHS       COUNTEM
    12                 0 
    12                 0 
    12                 2 
    25                 10
    25                 0 
    37                 1
    68                 3

I want to get the total number of rows (by TOTAL_MONTHS) for which the 'COUNTEM' value falls within a preset bin. 我想获取“ COUNTEM”值落在预设bin中的总行数(按TOTAL_MONTHS分)。

The data is going to be entered into a histogram via excel/powerpoint with: 数据将通过excel / powerpoint输入到直方图中:

X-axis = Number of contracts X轴=合约数量

Y-axis = Total_months Y轴= Total_months

Color of bar = COUNTEM 条形颜色= COUNTEM

The input of the graph is like this (columns being COUNTEM bins): 图的输入是这样的(列为COUNTEM个bin):

MONTHS    0    1-3    4-6    7-10    10+    20+
0         0    0      0      0       0      0  
1         0    0      0      0       0      0   
2         0    0      0      0       0      0
3         0    0      0      0       0      0
...
12        2    1      0      0       0      0
...
25        1    0      0      0       1      0
...
37        0    1      0      0       0      0
...
68        0    1      0      0       0      0

Ideally I'd like the code to output a dataframe in that format. 理想情况下,我希望代码以该格式输出数据帧。

Interesting problem. 有趣的问题。 Knowing pandas (as I don't properly) there may well be a much fancier and simpler solution to this. 认识大熊猫(因为我做得不好),可能会有更简单,更简单的解决方案。 However, doing it through iterations is also possible in the following manner: 但是,也可以通过以下方式进行迭代:

#First, imports and create your data
import pandas as pd

DF = pd.DataFrame({'TOTAL_MONTHS'   : [12, 12, 12, 25, 25, 37, 68], 
                   'COUNTEM'        : [0, 0, 2, 10, 0, 1, 3]
                   })

#Next create a data frame of 'bins' with the months as index and all
#values set at a default of zero
New_DF = pd.DataFrame({'bin0'   : 0,
                       'bin1'   : 0,
                       'bin2'   : 0,
                       'bin3'   : 0,
                       'bin4'   : 0,
                       'bin5'   : 0}, 
                       index = DF.TOTAL_MONTHS.unique())

In [59]: New_DF
Out[59]: 
    bin0  bin1  bin2  bin3  bin4  bin5
12     0     0     0     0     0     0
25     0     0     0     0     0     0
37     0     0     0     0     0     0
68     0     0     0     0     0     0

#Create a list of bins (rather than 20 to infinity I limited it to 100)
bins = [[0], range(1, 4), range(4, 7), range(7, 10), range(10, 20), range(20, 100)]

#Now iterate over the months of the New_DF index and slice the original
#DF where TOTAL_MONTHS equals the month of the current iteration. Then
#get a value count from the original data frame and use integer indexing
#to place the value count in the appropriate column of the New_DF:

for month in New_DF.index:
    monthly = DF[DF['TOTAL_MONTHS'] == month]
    counts = monthly['COUNTEM'].value_counts()
    for count in counts.keys():
        for x in xrange(len(bins)):
            if count in bins[x]:
                New_DF.ix[month, x] = counts[count]

Which gives me: 这给了我:

In [62]: New_DF
Out[62]: 
    bin0  bin1  bin2  bin3  bin4  bin5
12     2     1     0     0     0     0
25     1     0     0     0     1     0
37     0     1     0     0     0     0
68     0     1     0     0     0     0

Which appears to be what you want. 这似乎是您想要的。 You can rename the index as you see fit.... 您可以根据需要重命名索引...。

Hope this helps. 希望这可以帮助。 Perhaps someone has a solution that uses a built in pandas function, but for now this seems to work. 也许有人有使用内置的pandas函数的解决方案,但是目前看来,它是可行的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM