简体   繁体   English

来自 SQL 查询的 Python 数据分析

[英]Python Data Analysis from SQL Query

I'm about to start some Python Data analysis unlike anything I've done before.我即将开始一些 Python 数据分析,这与我以前做过的任何事情都不一样。 I'm currently studying numpy, but so far it doesn't give me insight on how to do this.我目前正在学习 numpy,但到目前为止它并没有让我了解如何做到这一点。

I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.我正在使用带有 cx_Oracle 的 python 2.7.14 Anaconda 来查询复杂记录。

Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple).每条记录都将是一个唯一的个体,其中有一列用于员工 ID、关系元组(关系类型代码与部门编号配对,可能包含多个)、帐户标志(标志字符串,可能包含多个)。 (3 columns total) (共 3 列)

so one record might be:所以一个记录可能是:

 [(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]

I need to develop a python script that will take these records and create various counts.我需要开发一个 python 脚本来获取这些记录并创建各种计数。

The example record would be counted in at least 9 different counts示例记录将被计入至少 9 个不同的计数
How many with relationship: 135有多少关系:135
How many with relationship: 212有多少关系:212
How many with relationship: 198有多少关系:198
How many in Department: 2345678部门人数:2345678
How many in Department: 4354670部门人数:4354670
How many in Department: 9876545部门人数:9876545
How many with Flag: Flag1带 Flag 数量:Flag1
How many with Flag: Flag2带 Flag 的数量:Flag2
How many with Flag: Flag3带 Flag 的数量:Flag3

The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.另一个棘手的部分是,我无法预先定义关系代码、部门或标志 我要计算的内容必须由从查询中检索到的数据来确定。

Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.一旦我理解了如何做到这一点,希望下一步也能得到 X 有多少关系 Flag y 等等,这将是直观的。

I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful.我知道这有很多问题要问,但是如果有人能指出我正确的方向,这样我就可以研究或尝试一些非常有帮助的教程。 Thank you!谢谢!

At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).至少您需要对这些数据进行结构化以进行良好的分析,您可以在数据库引擎或 Python 中进行(我将通过这种方式进行,使用 SNygard 建议的 Pandas)。

At first, I create some fake data(it was provided by you):首先,我创建了一些假数据(由您提供):

import pandas as pd 
import numpy as np
from ast import literal_eval

data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]

df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df

This return a dataframe like this: raw_pandas_dataframe这将返回一个像这样的数据帧: raw_pandas_dataframe

In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.为了按列汇总或统计,我们需要改进我们的数据结构,以某种方式我们可以应用group by 操作与部门、关系或标志。

We will convert our relationships and flags columns from string type to a python list of strings.我们将把我们的关系和标志列从字符串类型转换为字符串的 Python 列表。 So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.因此,flags 列将是一个 Python 标志列表,而 Relations 列将是一个 Python 关系列表。

df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')

df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df

The result is: dataframe_1结果是: dataframe_1

With our relationships column converted to list, we can create a new dataframe with as much columns as relations in that lists we have.随着我们relationships的列转换为列表中,我们可以创建一个新的数据框尽可能多的列在列表中,我们有关系。

rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)

After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)之后我们需要堆叠我们的列保留其索引,所以我们将使用pandas multi_index:id和关系列号(0,1,2,3)

relations = rel.stack()
relations.index.names = ['id','relation_number']
relations

We get: dataframe_2我们得到: dataframe_2

At this moment we have all of our relations in rows, but still we can't group by using relation_type feature.此时我们所有的关系都在行中,但我们仍然无法使用relation_type功能进行分组。 So we will split our relations data in two columns: relation_type and department using : .因此,我们将使用:将关系数据拆分为两列: relation_typedepartment

clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations

The result is dataframe_3_clear_relations结果是dataframe_3_clear_relations

Our relations are ready to analyze, but our flags structure still is very useless.我们的关系已经准备好进行分析,但是我们的 flags 结构仍然非常无用。 So we will convert the flag list, to columns and after that we will stack them.因此,我们将标志列表转换为列,然后将它们堆叠起来。

flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']

The result is dataframe_4_clear_flags结果是dataframe_4_clear_flags


Voilá!, It's all ready to analyze!.瞧!,一切就绪,可以分析了!。

So, for example, how many relations from each type we have, and wich one is the biggest :因此,例如,我们拥有每种类型的关系有多少,其中一个是最大的

clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)

We get: group_by_relation_type我们得到: group_by_relation_type


All code: Github project所有代码: Github 项目

If you're willing to consider other packages, take a look at pandas which is built on top of numpy.如果您愿意考虑其他软件包,请查看构建在 numpy 之上的pandas You can read sql statements directly into a dataframe, then filter.您可以将 sql 语句直接读入数据帧,然后进行过滤。

For example,例如,

import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)

# Your output might look like the following:

        0                                         1                     2
0   12346   (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1   12345   (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)

# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?

# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]

The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas . Pandas格式比问答更深入,但这里有一些很好的资源: 10 分钟到 Pandas

You can also filter the rows directly, though it may not be the most efficient.您也可以直接过滤行,尽管它可能不是最有效的。 For example, the following query selects only the rows where a relationship starts with '212'.例如,以下查询仅选择关系以“212”开头的行。

df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM