繁体   English   中英

使用Pandas如何使用列数据进行大数据的统计分析

[英]Using Pandas how to use column data for statistics analysis for big data

在我的数据集中,我有 48000 个村庄,每个村庄有 10 到 12 种农作物,每个村庄每种农作物的播种面积,我想找出哪些农作物在哪个村庄的主要面积,以及所有农作物在那个村庄中,作物 1 与……作物 n 的百分比是多少。 即我想找出村庄明智的作物比例,如果村庄 A 有作物 1 和作物 2,那么 A 有作物 1 和作物 2 的百分比

所以,接下来我可以对特定作物的村庄进行排名。然后我可以了解哪种作物是哪个村庄播种的大面积。

  District   Taluka            Village Name       Crop        Area in hec
0   Ahmednagar  Pathardi          Alhanwadi   Bajara        370.0
1   Ahmednagar  Pathardi             Adgaon   Bajara        302.0
2   Ahmednagar  Pathardi       Ambika Nagar   Bajara         40.0
3   Ahmednagar  Pathardi         Bharajwadi   Bajara         90.0
4   Ahmednagar  Pathardi           Bhalgaon   Bajara        254.0
5   Ahmednagar  Pathardi  Bhawarwadi (N.V.)   Bajara         35.0
6   Ahmednagar  Pathardi           Badewadi   Bajara         17.0
7   Ahmednagar  Pathardi              Akola   Bajara        175.0
8   Ahmednagar  Pathardi          Auranjpur   Bajara         35.0
9   Ahmednagar  Pathardi          Agaskhand   Bajara        100.0
10  Ahmednagar  Pathardi          Alhanwadi   Cotton        150.0
11  Ahmednagar  Pathardi             Adgaon   Cotton        310.0
12  Ahmednagar  Pathardi       Ambika Nagar   Cotton        131.0
13  Ahmednagar  Pathardi         Bharajwadi   Cotton        161.0
14  Ahmednagar  Pathardi           Bhalgaon   Cotton        562.0
15  Ahmednagar  Pathardi  Bhawarwadi (N.V.)   Cotton        211.0
16  Ahmednagar  Pathardi           Badewadi   Cotton        104.0
17  Ahmednagar  Pathardi              Akola   Cotton        550.0
18  Ahmednagar  Pathardi          Auranjpur   Cotton          0.0
19  Ahmednagar  Pathardi          Agaskhand   Cotton          0.0
20  Ahmednagar  Pathardi          Alhanwadi  Soybean         26.0
21  Ahmednagar  Pathardi             Adgaon  Soybean         52.0
22  Ahmednagar  Pathardi       Ambika Nagar  Soybean         72.0
23  Ahmednagar  Pathardi         Bharajwadi  Soybean         88.0
24  Ahmednagar  Pathardi           Bhalgaon  Soybean         90.0
25  Ahmednagar  Pathardi  Bhawarwadi (N.V.)  Soybean         93.0
26  Ahmednagar  Pathardi           Badewadi  Soybean        100.0
27  Ahmednagar  Pathardi              Akola  Soybean         10.0
28  Ahmednagar  Pathardi          Auranjpur  Soybean         45.0
29  Ahmednagar  Pathardi          Agaskhand  Soybean         20.0
30  Ahmednagar  Pathardi          Alhanwadi    Maize         10.0
31  Ahmednagar  Pathardi             Adgaon    Maize          1.5
32  Ahmednagar  Pathardi       Ambika Nagar    Maize          3.0
33  Ahmednagar  Pathardi         Bharajwadi    Maize          5.0
34  Ahmednagar  Pathardi           Bhalgaon    Maize         12.0
35  Ahmednagar  Pathardi  Bhawarwadi (N.V.)    Maize         51.0
36  Ahmednagar  Pathardi           Badewadi    Maize          5.0
37  Ahmednagar  Pathardi              Akola    Maize         25.0
38  Ahmednagar  Pathardi          Auranjpur    Maize          5.0
39  Ahmednagar  Pathardi          Agaskhand    Maize         10.0

import pandas as pd

import numpy as np

D=pd.read_excel("/media/desktop/Sample-2.xlsx","Sheet1")

village=D["Village Name"].unique()

crop=D["Crop"].unique()

q1=[]

for i in village:

    for j in crop:
        a=D["Village Name"]==i
        b=D["Crop"]==j
        D1=D[a&b]
        q1.append(D1)
q2=[]

for i in q1:

    if i.empty==False:
        q2.append(i)

现在我们可以得到以公顷为单位的村庄明智的农作物播种面积,接下来我们必须计算crop-1 的村庄A %,crop-2 的% ... %crop-n。

公式:对于 Crop-1 的村庄 A 是 Crop-1/该村庄的所有作物,我们得到该村庄的 Crop-1 %,以同样的方式找出 Crop-2 的 %。

所有村庄都一样。

有什么建议吗?

首先是每个村庄使用的顶级作物:

df1 = df.sort_values(['Village Name','Area in hec'], ascending=[True, False])

df2 = df1.drop_duplicates('Village Name')
print (df2)
      District    Taluka       Village Name     Crop  Area in hec
11  Ahmednagar  Pathardi             Adgaon   Cotton        310.0
9   Ahmednagar  Pathardi          Agaskhand   Bajara        100.0
17  Ahmednagar  Pathardi              Akola   Cotton        550.0
0   Ahmednagar  Pathardi          Alhanwadi   Bajara        370.0
12  Ahmednagar  Pathardi       Ambika Nagar   Cotton        131.0
28  Ahmednagar  Pathardi          Auranjpur  Soybean         45.0
16  Ahmednagar  Pathardi           Badewadi   Cotton        104.0
14  Ahmednagar  Pathardi           Bhalgaon   Cotton        562.0
13  Ahmednagar  Pathardi         Bharajwadi   Cotton        161.0
15  Ahmednagar  Pathardi  Bhawarwadi (N.V.)   Cotton        211.0

以及每种作物的面积百分比:

s = df1.groupby("Crop")['Area in hec'].transform('sum')
df1['perc'] =  df1['Area in hec'].div(s).mul(100)
print (df1.head(10))
      District    Taluka Village Name     Crop  Area in hec       perc
11  Ahmednagar  Pathardi       Adgaon   Cotton        310.0  14.226709
1   Ahmednagar  Pathardi       Adgaon   Bajara        302.0  21.297602
21  Ahmednagar  Pathardi       Adgaon  Soybean         52.0   8.724832
31  Ahmednagar  Pathardi       Adgaon    Maize          1.5   1.176471
9   Ahmednagar  Pathardi    Agaskhand   Bajara        100.0   7.052186
29  Ahmednagar  Pathardi    Agaskhand  Soybean         20.0   3.355705
39  Ahmednagar  Pathardi    Agaskhand    Maize         10.0   7.843137
19  Ahmednagar  Pathardi    Agaskhand   Cotton          0.0   0.000000
17  Ahmednagar  Pathardi        Akola   Cotton        550.0  25.240936
7   Ahmednagar  Pathardi        Akola   Bajara        175.0  12.341326

首先使用 groupby 将每个城市的面积总和聚合为一个总数

total_lands = D.groupby(["Village Name"])['Area in hec'].agg(['sum']).drop_index()

然后分组城市和作物以获得每个城市每种作物的总数

lands_by_crop = D.groupby(["Village Name","Crop"])['Area in hec'].agg(['sum'])

最后计算百分比...

percentages = lands_by_crop.map(lambda x:x/total_lands[x.index["Village Name"]])

我认为应该可行(在最后一步不完全确定)......并且可能有更有效的方法来解决它,我不确定

要了解村庄的农作物数量,请使用以下方法:

D.filter(items = ["VillageName","Crop", "Area"],axis=1).groupby(by = ["VillageName","Crop"])

然后,您可以除以农作物面积的 sumTotal(D.filter(items = ["Crop", "Area"],axis=1).groupby(by = "Crop")) 或村庄面积的 sumTotal (D.filter( items = ["VillageName", "Area"],axis=1).groupby(by = "VillageName")) 得到比例。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM