简体   繁体   中英

Python Data Analysis from SQL Query

I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.

I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.

Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)

so one record might be:

 [(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]

I need to develop a python script that will take these records and create various counts.

The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3

The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.

Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.

I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!

At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).

At first, I create some fake data(it was provided by you):

import pandas as pd 
import numpy as np
from ast import literal_eval

data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]

df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df

This return a dataframe like this: raw_pandas_dataframe

In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.

We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.

df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')

df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df

The result is: dataframe_1

With our relationships column converted to list, we can create a new dataframe with as much columns as relations in that lists we have.

rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)

After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)

relations = rel.stack()
relations.index.names = ['id','relation_number']
relations

We get: dataframe_2

At this moment we have all of our relations in rows, but still we can't group by using relation_type feature. So we will split our relations data in two columns: relation_type and department using : .

clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations

The result is dataframe_3_clear_relations

Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.

flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']

The result is dataframe_4_clear_flags


Voilá!, It's all ready to analyze!.

So, for example, how many relations from each type we have, and wich one is the biggest :

clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)

We get: group_by_relation_type


All code: Github project

If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.

For example,

import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)

# Your output might look like the following:

        0                                         1                     2
0   12346   (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1   12345   (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)

# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?

# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]

The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas .

You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.

df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM