对于每个分组依据的熊猫，都会在字符串列中枚举并转换为反词典

Question

I'm trying to automate building a networkx graph for any input pandas dataframe. 我正在尝试为任何输入熊猫数据框自动构建networkx图。

The dataframe looks like this: 数据框如下所示：

  FeatureID       BC         chrom       pos        ftm_call
  1_1_1           GCTATT     12          25398138   NRAS_3
  1_1_1           GCCTAT     12          25398160   NRAS_3
  1_1_1           GCCTAT     12          25398073   NRAS_3
  1_1_1           GATCCT     12          25398128   NRAS_3
  1_1_1           GATCCT     12          25398107   NRAS_3

Here's the algorithm I need to sort out: 这是我需要整理的算法：

Group by FeatureID 按FeatureID分组
For each FeatureID, select graph with "name" attribute that matches ftm_call 对于每个FeatureID，选择具有“ name”属性且与ftm_call相匹配的图形
For each row in group, enumerate over the BC column where starting position equals the value in the pos column 对于组中的每一行，在BC列上枚举，其中起始位置等于pos列中的值
For every letter in BC, check if that letter is already found in the graph at that position, and if not, add with weight of 1. If already there, add 1 to weight 对于BC中的每个字母，请检查是否已在图形中的该位置找到该字母，否则，请添加权重为1。如果已经存在，则将权重添加1

So far, here is what I have: 到目前为止，这是我所拥有的：

import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict

# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df = hamming_df[["FeatureID", "BC", "chrom", "pos"]]

# initiate graphs 
G = nx.DiGraph(name="G")
KRAS = nx.DiGraph(name="KRAS")
NRAS_3 = nx.DiGraph(name="NRAS_3")

# list of reference graphs
ref_graph_list = [G, KRAS, NRAS_3]

def add_basecalls(row):
    basecall = row.BC.astype(str)
    target = row.name[1]
    pos = row["pos"]
    chrom = row["chrom"]

    # initialize counter dictionary
    d = defaultdict()

    # select graph that matches ftm call
    graph = [f for f in ref_graph_list if f.graph["name"] == target]

stuff = hamming_df.groupby(["FeatureID", "ftm_call"])  
stuff.apply(add_basecalls)

But this isn't pulling out the barcodes as strings that I can just enumerate across, it's pulling them out as a series and I'm stuck. 但是，这并不是将条形码以字符串的形式提取出来，而是以一系列的形式提取出来，而我被卡住了。

Desired output is a graph containing the following information, example shown for the first BC "GCTATT" with fictitious counts: 所需的输出是一个包含以下信息的图形，其中第一个BC“ GCTATT”的示例显示为虚拟计数：

FeatureID    chrom    pos         Nucleotide    Weight
1_1_1        12       25398138       G            10
1_1_1        12       25398138       C            22
1_1_1        12       25398139       T            12
1_1_1        12       25398140       A            15
1_1_1        12       25398141       T            18
1_1_1        12       25398142       T            22

Thanks in advance! 提前致谢！

Answer 1

You probably need an additional apply with axis=1 to parse the rows for each group: 您可能需要使用axis=1的附加apply来解析每个组的行：

import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict

# initiate graphs
GRAPHS = {"G": nx.DiGraph(name="G"),
          "KRAS": nx.DiGraph(name="KRAS"),
          "NRAS_3": nx.DiGraph(name="NRAS_3"), # notice that test_data.txt has "NRAS_3" not "KRAS_3"
     }

WEIGHT_DICT = defaultdict()

def update_weight_for_row(row, target_graph):
    pos = row["pos"]
    chrom = row["chrom"]
    for letter in row.BC:
        print(letter)
        # now you have access to letters in BC per row
        # and can update graph weights as desired

def add_basecalls(grp):
    # select graph that matches ftm_call
    target = grp.name[1]
    target_graph = GRAPHS[target]
    grp.apply(lambda row: update_weight_for_row(row, target_graph), axis=1)

# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="\t")
hamming_df2 = hamming_df[["FeatureID", "BC", "chrom", "pos"]]  # Why is this line needed?
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])  
stuff.apply(lambda grp: add_basecalls(grp))

对于每个分组依据的熊猫，都会在字符串列中枚举并转换为反词典

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-08-21 21:17:59

对于每个分组依据的熊猫，都会在字符串列中枚举并转换为反词典

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-08-21 21:17:59

解决方案1
1 已采纳 2018-08-21 21:17:59