简体   繁体   中英

How to obtain an origin-destiny Matrix from a Json in Python?

Having a JSON file like:

[ {"fecha" : "2013-07-01","indicativo" : "3195","nombre" : "MADRID,RETIRO","orig" : "10","dest" : "122","value" : "15"},{"fecha" :"2013-07-02","indicativo" : "3195","nombre" : "MADRID, RETIRO","orig" :"15","dest" : "5","value" : "15"},{"fecha" : "2013-07-03","indicativo" :"3195","nombre" : "MADRID, RETIRO","orig" : "5","dest" : "15","value" :"15"},{"fecha" : "2013-07-04","indicativo" : "3195","nombre" : "MADRID,RETIRO","orig" : "10","dest" : "122","value" : "15"}]

What I'm trying to obtain is a matrix that has orig field value as row y dest field value as a column. In the matrix, I would like to have the number of occurrences that have this orig and dest.

Example with the provided data:

| data | 5 | 10 | 15 | 122 |
|------|---|----|----|-----|
| 5    | 0 | 0  | 1  | 0   |
| 10   | 0 | 0  | 0  | 2   |
| 15   | 1 | 0  | 0  | 0   |
| 122  | 0 | 0  | 0  | 0   |

Basically I want to ghave a table taht shows me for exampe that for orig =10 and dest =122 I hav in the json 2 ocurrencies.

I understand I need to paser first json and transform it into a dataframe.

The problem is once I have this df, how I can create a matrix with as many rows as different orig and dest I have ( notice that they re lije bases ID and if Ihave a number 122 in dest but not in orig this means that no one trvae form this point but some of them arrrived to it).

Thinking I could imagine that I need first to extract different id I have in orig and dest , and then parsing each rown of the json and incrementin by one the df[orig][dest] position. But Is there any other more effiencient and quickly solution for this?

Say you have loaded your JSON file into a list of dict named data :

df = pd.DataFrame(data)

df.groupby(['orig', 'dest']).size().unstack().fillna(0).astype(int)

This makes groups of all the unique orig, dest pairs and gets the size of each group (in other words, how many rows have those two unique values of orig and dest ), which will form one value in the final dataframe.

With unstack , we can convert one level of the index to column names such that the unique values of orig are in the index and those for dest are in the columns.

Finally, we fill the null values (representing pairs which did not exist) with 0 and cast the dataframe back to int for presentability.

Testing with randomly generated data:

orig_data = np.random.choice(['a', 'b', 'c', 'd', 'e'], 100, p=[0.35, 0.30, 0.20, 0.10, 0.05])
dest_data = np.random.choice(['a', 'b', 'c', 'd', 'e'], 100, p=[0.20, 0.25, 0.25, 0.20, 0.10])

data = [{'orig': orig, 'dest': dest} for orig, dest in zip(orig_data, dest_data)]

df = pd.DataFrame(data)

df.groupby(['orig', 'dest']).size().unstack().fillna(0).astype(int)

Output:

dest   a  b  c  d  e
orig                
a      4  9  8  4  3
b     11  8  5  4  6
c      5  2  3  4  5
d      4  3  3  1  3
e      1  0  3  1  0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM