[英]Using Pandas how to use column data for statistics analysis for big data
在我的数据集中,我有 48000 个村庄,每个村庄有 10 到 12 种农作物,每个村庄每种农作物的播种面积,我想找出哪些农作物在哪个村庄的主要面积,以及所有农作物在那个村庄中,作物 1 与……作物 n 的百分比是多少。 即我想找出村庄明智的作物比例,如果村庄 A 有作物 1 和作物 2,那么 A 有作物 1 和作物 2 的百分比
所以,接下来我可以对特定作物的村庄进行排名。然后我可以了解哪种作物是哪个村庄播种的大面积。
District Taluka Village Name Crop Area in hec
0 Ahmednagar Pathardi Alhanwadi Bajara 370.0
1 Ahmednagar Pathardi Adgaon Bajara 302.0
2 Ahmednagar Pathardi Ambika Nagar Bajara 40.0
3 Ahmednagar Pathardi Bharajwadi Bajara 90.0
4 Ahmednagar Pathardi Bhalgaon Bajara 254.0
5 Ahmednagar Pathardi Bhawarwadi (N.V.) Bajara 35.0
6 Ahmednagar Pathardi Badewadi Bajara 17.0
7 Ahmednagar Pathardi Akola Bajara 175.0
8 Ahmednagar Pathardi Auranjpur Bajara 35.0
9 Ahmednagar Pathardi Agaskhand Bajara 100.0
10 Ahmednagar Pathardi Alhanwadi Cotton 150.0
11 Ahmednagar Pathardi Adgaon Cotton 310.0
12 Ahmednagar Pathardi Ambika Nagar Cotton 131.0
13 Ahmednagar Pathardi Bharajwadi Cotton 161.0
14 Ahmednagar Pathardi Bhalgaon Cotton 562.0
15 Ahmednagar Pathardi Bhawarwadi (N.V.) Cotton 211.0
16 Ahmednagar Pathardi Badewadi Cotton 104.0
17 Ahmednagar Pathardi Akola Cotton 550.0
18 Ahmednagar Pathardi Auranjpur Cotton 0.0
19 Ahmednagar Pathardi Agaskhand Cotton 0.0
20 Ahmednagar Pathardi Alhanwadi Soybean 26.0
21 Ahmednagar Pathardi Adgaon Soybean 52.0
22 Ahmednagar Pathardi Ambika Nagar Soybean 72.0
23 Ahmednagar Pathardi Bharajwadi Soybean 88.0
24 Ahmednagar Pathardi Bhalgaon Soybean 90.0
25 Ahmednagar Pathardi Bhawarwadi (N.V.) Soybean 93.0
26 Ahmednagar Pathardi Badewadi Soybean 100.0
27 Ahmednagar Pathardi Akola Soybean 10.0
28 Ahmednagar Pathardi Auranjpur Soybean 45.0
29 Ahmednagar Pathardi Agaskhand Soybean 20.0
30 Ahmednagar Pathardi Alhanwadi Maize 10.0
31 Ahmednagar Pathardi Adgaon Maize 1.5
32 Ahmednagar Pathardi Ambika Nagar Maize 3.0
33 Ahmednagar Pathardi Bharajwadi Maize 5.0
34 Ahmednagar Pathardi Bhalgaon Maize 12.0
35 Ahmednagar Pathardi Bhawarwadi (N.V.) Maize 51.0
36 Ahmednagar Pathardi Badewadi Maize 5.0
37 Ahmednagar Pathardi Akola Maize 25.0
38 Ahmednagar Pathardi Auranjpur Maize 5.0
39 Ahmednagar Pathardi Agaskhand Maize 10.0
import pandas as pd
import numpy as np
D=pd.read_excel("/media/desktop/Sample-2.xlsx","Sheet1")
village=D["Village Name"].unique()
crop=D["Crop"].unique()
q1=[]
for i in village:
for j in crop:
a=D["Village Name"]==i
b=D["Crop"]==j
D1=D[a&b]
q1.append(D1)
q2=[]
for i in q1:
if i.empty==False:
q2.append(i)
现在我们可以得到以公顷为单位的村庄明智的农作物播种面积,接下来我们必须计算crop-1 的村庄A %,crop-2 的% ... %crop-n。
公式:对于 Crop-1 的村庄 A 是 Crop-1/该村庄的所有作物,我们得到该村庄的 Crop-1 %,以同样的方式找出 Crop-2 的 %。
所有村庄都一样。
有什么建议吗?
首先是每个村庄使用的顶级作物:
df1 = df.sort_values(['Village Name','Area in hec'], ascending=[True, False])
df2 = df1.drop_duplicates('Village Name')
print (df2)
District Taluka Village Name Crop Area in hec
11 Ahmednagar Pathardi Adgaon Cotton 310.0
9 Ahmednagar Pathardi Agaskhand Bajara 100.0
17 Ahmednagar Pathardi Akola Cotton 550.0
0 Ahmednagar Pathardi Alhanwadi Bajara 370.0
12 Ahmednagar Pathardi Ambika Nagar Cotton 131.0
28 Ahmednagar Pathardi Auranjpur Soybean 45.0
16 Ahmednagar Pathardi Badewadi Cotton 104.0
14 Ahmednagar Pathardi Bhalgaon Cotton 562.0
13 Ahmednagar Pathardi Bharajwadi Cotton 161.0
15 Ahmednagar Pathardi Bhawarwadi (N.V.) Cotton 211.0
以及每种作物的面积百分比:
s = df1.groupby("Crop")['Area in hec'].transform('sum')
df1['perc'] = df1['Area in hec'].div(s).mul(100)
print (df1.head(10))
District Taluka Village Name Crop Area in hec perc
11 Ahmednagar Pathardi Adgaon Cotton 310.0 14.226709
1 Ahmednagar Pathardi Adgaon Bajara 302.0 21.297602
21 Ahmednagar Pathardi Adgaon Soybean 52.0 8.724832
31 Ahmednagar Pathardi Adgaon Maize 1.5 1.176471
9 Ahmednagar Pathardi Agaskhand Bajara 100.0 7.052186
29 Ahmednagar Pathardi Agaskhand Soybean 20.0 3.355705
39 Ahmednagar Pathardi Agaskhand Maize 10.0 7.843137
19 Ahmednagar Pathardi Agaskhand Cotton 0.0 0.000000
17 Ahmednagar Pathardi Akola Cotton 550.0 25.240936
7 Ahmednagar Pathardi Akola Bajara 175.0 12.341326
首先使用 groupby 将每个城市的面积总和聚合为一个总数
total_lands = D.groupby(["Village Name"])['Area in hec'].agg(['sum']).drop_index()
然后分组城市和作物以获得每个城市每种作物的总数
lands_by_crop = D.groupby(["Village Name","Crop"])['Area in hec'].agg(['sum'])
最后计算百分比...
percentages = lands_by_crop.map(lambda x:x/total_lands[x.index["Village Name"]])
我认为应该可行(在最后一步不完全确定)......并且可能有更有效的方法来解决它,我不确定
要了解村庄的农作物数量,请使用以下方法:
D.filter(items = ["VillageName","Crop", "Area"],axis=1).groupby(by = ["VillageName","Crop"])
然后,您可以除以农作物面积的 sumTotal(D.filter(items = ["Crop", "Area"],axis=1).groupby(by = "Crop")) 或村庄面积的 sumTotal (D.filter( items = ["VillageName", "Area"],axis=1).groupby(by = "VillageName")) 得到比例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.