简体   繁体   English

通过 python (Py2neo) 将大型数据集转录为 Neo4j

[英]Transcribing large datasets into Neo4j via python (Py2neo)

I've spent the last few weeks trying to load a genomic dataset into Neo4j, using the Scikit Allel library.过去几周我一直在尝试使用 Scikit Allel 库将基因组数据集加载到 Neo4j 中。 I've managed to load all variants for the exomes in a VCF file and all subjects with related phenotype data, I'm now just attempting to create the relationships between the variants and the subjects.我已经设法在 VCF 文件中加载了外显子组的所有变体以及具有相关表型数据的所有受试者,我现在只是尝试创建变体和受试者之间的关系。 I'm not very experienced in python, and I don't think this question needs a good understanding of either genomics or the Scikit-Allel library, so don't be scared off by that.我在 python 方面不是很有经验,我认为这个问题不需要对基因组学或 Scikit-Allel 库有很好的理解,所以不要被那个吓到。

The code below works, but incredibly slowly.下面的代码有效,但速度非常慢。 I think creating a dataframe, or sets of lists for each subject might be a better way to do this, rather than creating a graph transaction for each element in the j for loop, but any advice would be appreciated on how best to do this.我认为为每个主题创建一个数据框或一组列表可能是一个更好的方法,而不是为j for 循环中的每个元素创建一个图形事务,但任何关于如何最好地做到这一点的建议都将不胜感激。 I have looked at including Numba for this as well, but wouldn't know where to start.我也考虑过将 Numba 包括在内,但不知道从哪里开始。

This process is creating ~1 new subject per hour, per chromosome out of ~1750 subjects and 23 chromosomes, so it will take a very long time for the current set up to work.这个过程每小时会在大约 1750 名受试者和 23 条染色体中的每条染色体上创建大约 1 个新受试者,因此当前设置需要很长时间才能起作用。

import pandas as pd
import csv
import math
import allel
import zarr
from py2neo import Graph, Node, Relationship, NodeMatcher

zarr_path = '.../chroms.zarr'

callset = zarr.open_group(zarr_path, mode='r')

samples = callset[chrom]['samples']

graph = Graph(user="neo4j", password="password")

chrom_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X']

for chrom in chrom_list:


    variants = allel.VariantChunkedTable(callset[chrom]['variants'], names=['AC','AF_AFR', 'AF_AMR', 'AF_ASN', 'AF_EUR', 'AF_MAX', 'CGT', 'CLR', 'CSQ', 'DP', 'DP4', 'ESP_MAF', 'FILTER_LowQual', 'FILTER_MinHWE', 'FILTER_MinVQSLOD', 'FILTER_PASS', 'HWE', 'ICF', 'ID', 'IS', 'PC2', 'PCHI2', 'POS', 'PR', 'QCHI2', 'QUAL', 'REF', 'ALT', 'INDEL', 'SHAPEIT', 'SNP_ID', 'TYPE', 'UGT', 'VQSLOD', 'dbSNPmismatch', 'is_snp', 'numalt'], index='POS')
    pos = variants['POS'][:]
    SNPid = variants['ID'][:]
    ref = variants['REF'][:]
    alt = variants['ALT'][:]
    dp = variants['DP'][:]
    ac = variants['AC'][:]
    vartype = variants['TYPE'][:]
    qual = variants['QUAL'][:]
    vq = variants['VQSLOD'][:]
    numalt = variants['numalt'][:]
    csq = variants['CSQ'][:]
    vcfv = 'VCFv4.1'
    refv = 'file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta'
    dpz = callset[chrom]['calldata/DP']
    psz = callset[chrom]['calldata/PS']
    plz = callset[chrom]['calldata/PL']
    gpz = callset[chrom]['calldata/GP']
    calldata = callset[chrom]['calldata']
    gt = allel.GenotypeDaskArray(calldata['GT'])
    hap = gt.to_haplotypes()
    hap1 = hap[:, ::2]
    hap2 = hap[:, 1::2]
    i = 0
    for i in range(len(samples)):
        subject = samples[i]
        subject_node = matcher.match("Subject", subject_id= subject)
        if subject_node.first() is None:
            continue
        seq_tech = 'Illumina HiSeq 2000 (ILLUMINA)'
        dp = dpz[:, i]
        ps = psz[:, i]
        pl = plz[:, i]
        gp = gpz[:, i]
        list_h1 = hap1[:, i].compute()
        list_h2 = hap2[:, i].compute()
        chrom_label = "Chromosome_" + str(chrom)
        j = 0
        for j in range(len(pos)):  
            h1 = int(list_h1[j])
            h2 = int(list_h2[j])
            read_depth = int(dp[j])
            ps1 = int(ps[j])
            PL0 = int(pl[j][0])
            PL1 = int(pl[j][1])
            PL2 = int(pl[j][2])
            genotype = str(h1) + '|' + str(h2)
            GP0 = float(gp[j][0])
            GP1 = float(gp[j][1])
            GP2 = float(gp[j][2])
            k = int(pos[j])
            l = str(ref[j])
            m = str(alt[j][h1-1])
            o = str(alt[j][h2-1])

            if h1 == 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA=h1, HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            elif h1 == 0 and h2 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 > 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 == h2 and h1 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA = h1, HTB = h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            else:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)
        print("Subject " + subject + " completed.")
print(chrom_label + "completed.")

Any help is very much appreciated!很感谢任何形式的帮助!

Creating a lot of elements individually will always be slow, largely because of the number of network hops required.单独创建大量元素总是很慢,主要是因为所需的网络跃点数。 You're also carrying out matches in between each as well, which will further increase the time.您还在每个人之间进行比赛,这将进一步增加时间。

The best approach to this kind of problem is to look at batching, both for reads and for writes.解决此类问题的最佳方法是查看批处理,包括读取和写入。 While you can't do everything at once either, batching your operations into at least a few hundred at a time will have a significant effect.虽然您也无法一次性完成所有操作,但一次将您的操作分批进行至少数百个操作会产生显着效果。 In your case, you will probably need to carry out a bulk read, followed by a bulk write, and so on.在您的情况下,您可能需要执行批量读取,然后执行批量写入,依此类推。

So specifically, look to carry out a match against multiple entities (you might be able to use an "in" modifier for this, or you might need to drop into raw Cypher).因此,特别要针对多个实体进行匹配(您可能可以为此使用“in”修饰符,或者您可能需要放入原始 Cypher)。 For writes, build up a subgraph locally with the relevant nodes and relationships and create that in a single call.对于写入,使用相关节点和关系在本地构建一个子图,并在一次调用中创建它。

Your optimal batch size will only be discovered through experimentation, so you likely won't get this right first time.您的最佳批量大小只能通过实验发现,因此您可能不会第一次就做对。 But batching is definitely the key here.但批处理绝对是这里的关键。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM