简体   繁体   English

基因表达中的相关病例与对照

[英]Correlation Case vs Control in gene expression

I have gene expresssion data from 77 cancer patients. 我有77名癌症患者的基因表达数据。 I have one set from tha patients blood, one set from the patients tumor and one set from the patients healty tissue: 我有一组来自患者的血液,一组来自患者肿瘤,一组来自患者的健康组织:

data1 <- ExpressionBlood
data2 <- ExpressionCancerTissue
data3 <- ExpressionHealtyTissue

I would like to perform an analysis to se if the expression in the tumor tissue correlate with the expression in the blood for all my genes. 如果肿瘤组织中的表达与我所有基因的血液表达相关,我想进行分析。 What is the best way to do this? 做这个的最好方式是什么?

If you are familiar with python I'd use pandas . 如果你熟悉python,我会使用pandas It uses "DataFrames" similarly to R, so you could take the concept and apply it to R. 它使用类似于R的“DataFrames”,因此您可以采用该概念并将其应用于R.

Assuming your data1 is a delimited file formatted like this: 假设您的data1是一个分隔文件,格式如下:

GeneName | ExpValue |
gene1       300.0
gene2       250.0

Then you can do this to get each data type into a DataFrame : 然后,您可以执行此操作以将每种数据类型转换为DataFrame

dfblood = pd.read_csv('path/to/data1',delimiter='\t')
dftissue = pd.read_csv('path/to/data2',delimiter='\t')
dftumor = pd.read_csv('path/to/data3',delimiter='\t')

Now merge the DataFrame's into one master df . 现在merge DataFrame merge为一个主df

dftmp = pd.merge(dfblood,dftissue,on='GeneName',how='inner')
df = pd.merge(dftmp,dftumor,on='GeneName',how='inner')

Rename your columns, be careful to ensure the proper order. 重命名列,小心确保正确的顺序。

df.columns = ['GeneName','blood','tissue','tumor']

Now you can normalize your data (if it's not already) with easy commands. 现在,您可以使用简单命令对数据进行标准化(如果尚未标准化)。

df = df.set_index('GeneName') # allows you to perform computations on the entire dataset
df_norm = (df - df.mean()) / (df.max() - df.min())

You can all df_norm.corr() to produce the results below. 你可以使用df_norm.corr()生成下面的结果。 But at this point, you can use numpy to perform more complex calculations, if needed. 但此时,如果需要,您可以使用numpy执行更复杂的计算。

          blood      tissue       tumor
blood   1.000000    0.395160    0.581629
tissue  0.395160    1.000000    0.840973
tumor   0.581629    0.840973    1.000000

HTH at least move in the right direction. HTH至少朝着正确的方向前进。

EDIT 编辑

If you want to use Student T's log-fold change you could calculate the log of the original data using numpy.log 如果要使用Student T的对numpy.log更改,可以使用numpy.log计算原始数据的日志

import numpy as np

df[['blood','tissue','tumor']] = df[['blood','tissue','tumor']]+1
# +1 to avoid taking the log of 0
df_log = np.log(df[['blood','tissue','tumor']])

To get the 'log' fold change for each gene, this will append new columns to your df_log DataFrame. 要获得每个基因的“日志”倍数更改,这会将新列附加到您的df_log DataFrame。

df_log['logFCBloodTumor'] = df_log['blood'] - df_log['tumor']
df_log['logFCBloodTissue'] = df_log['blood'] - df_log['tissue']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM