First off, I realize that this question has been asked a ton of times in many different forms, but a lot of the answers just give code that solves the problem without explaining what the code actually does or why it works.
I have an enormous data set of phone numbers and area codes that I have loaded into a dataframe in python to do some processing with. Before I do that processing, I need to split the single dataframe into multiple dataframes that contain phone numbers in certain ranges of area codes that I can then do more processing on. For example:
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 5501231234 | 550 |
+---+--------------+-----------+
| 2 | 5051231234 | 505 |
+---+--------------+-----------+
| 3 | 5001231234 | 500 |
+---+--------------+-----------+
| 4 | 6201231234 | 620 |
+---+--------------+-----------+
into
area-codes (500-550)
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 5501231234 | 550 |
+---+--------------+-----------+
| 2 | 5051231234 | 505 |
+---+--------------+-----------+
| 3 | 5001231234 | 500 |
+---+--------------+-----------+
and
area-codes (600-650)
+---+--------------+-----------+
| | phone_number | area_code |
+---+--------------+-----------+
| 1 | 6201231234 | 620 |
+---+--------------+-----------+
I get that this should be possible using pandas (specifically groupby and a Series object I think) but the documentation and examples on the internet I could find were a little too nebulous or sparse for me to follow. Maybe there's a better way to do this than the way I'm trying to do it?
You can use pd.cut
to bin
the area
column , then use the labels to group the data and store in a dictionary. Finally print each key to see the dataframe:
bins=[500,550,600,650]
labels=['500-550','550-600','600-650']
d={f'area_code_{i}':g for i,g in
df.groupby(pd.cut(df.area_code,bins,include_lowest=True,labels=labels))}
print(d['area_code_500-550'])
print('\n')
print(d['area_code_600-650'])
phone_number area_code
0 5501231234 550
1 5051231234 505
2 5001231234 500
phone_number area_code
3 6201231234 620
You can also do this by select rows in dataframe by chaining multiple condition with &
or |
operator
df1 select rows with area_code between 500-550
df2 select rows with area_code between 600-650
df = pd.DataFrame({'phone_number':[5501231234, 5051231234, 5001231234 ,6201231234],
'area_code':[550,505,500,620]},
columns=['phone_number', 'area_code'])
df1 = df[ (df['area_code']>=500) & (df['area_code']<=550) ]
df2 = df[ (df['area_code']>=600) & (df['area_code']<=650) ]
df1
phone_number area_code
0 5501231234 550
1 5051231234 505
2 5001231234 500
df2
phone_number area_code
3 6201231234 620
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.