繁体   English   中英

How to get all relevant fields from a XML file into a pandas dataframe in Python using xml.etree.ElementTree?

[英]How to get all relevant fields from a XML file into a pandas dataframe in Python using xml.etree.ElementTree?

我正在尝试从基因表达综合分析 XML 文件。 我发现了如何获取一些数据字段,但我不知道如何获取像<Title>这样的信息。

我尝试调整: 如何将 XML 文件转换为漂亮的 pandas dataframe? 但只能得到一些信息。

如何将所有可用数据提取到 pandas dataframe 中?

这是XML 文件的示例:

<Sample iid="GSM2978341">
    <Status database="GEO">
      <Submission-Date>2018-02-05</Submission-Date>
      <Release-Date>2019-03-25</Release-Date>
      <Last-Update-Date>2019-03-25</Last-Update-Date>
    </Status>
    <Title>PDD_P2_70</Title>
    <Accession database="GEO">GSM2978341</Accession>
    <Type>SRA</Type>
    <Channel-Count>1</Channel-Count>
    <Channel position="1">
      <Source>AZ-LolCDE</Source>
      <Organism taxid="679895">Escherichia coli BW25113</Organism>
      <Characteristics tag="strain">
BW25113
      </Characteristics>
      <Characteristics tag="type">
Gram-negative bacteria
      </Characteristics>
      <Characteristics tag="moa">
cell wall synthesis inhibitor / lipoprotein
      </Characteristics>
      <Characteristics tag="phenotype">
EC90 of phenotype
      </Characteristics>
      <Characteristics tag="treatment time">
~ 25 min
      </Characteristics>
      <Characteristics tag="treatment concentration">
200 uM
      </Characteristics>
      <Treatment-Protocol>
bacteria were treated with different antibiotics for ~ 25 min till  ~OD 0.2  in 2 ml tubes
      </Treatment-Protocol>
      <Growth-Protocol>
bacteria were grown in iso-sensitest medium
      </Growth-Protocol>
      <Molecule>total RNA</Molecule>
      <Extract-Protocol>
after treament bacteria were resuspended in QiaGen RNAprotect Bacteria Reagent (QiaGen #76506), incubated for 5min, centrifuged, and flash frozen on dry ice. Total RNA was extracted by incubating bacteria in Enzymatic Lysis Buffer  (lysozyme &amp; proteinase K) for 5 min followed by addition of QiaGen RLT Lysis Buffer and RNA purification  using the QiaGen RNeasy Mini kit combined with DNase treatment on a solid support (QiaGen #74104). RNA quality assessment and quantification was performed using microfluidic chip analysis on an Agilent 2100 bioanalyzer (Agilent Technologies).
For RNA-sequencing library preparation, 1000 ng total RNA was used as input. First, bacterial ribosomal RNA was depleted using the Ribo-Zero Magnetic Kit Bacteria (Illumina #MRZB12424). After depletion, RNA was resuspended in TruSeq Total RNA Sample Prep Kit Fragmentation buffer (8.5 ul RNA and 8.5 buffer) and reversed transcribed into cDNA using random hexamer primer. Then cDNA was further processed for the construction of sequencing libraries according to the manufacturer's recommendations using the TruSeq Stranded mRNA Sample Prep Kit (Illimina #RS-122-2101). Sequencing was performed with the Illumina TruSeq SBS Kit v4-HS chemistry (Illumina #FC-401-4003) on an Illumina HiSeq2500 instrument with 50 cycles of 2x50 bp paired-end sequencing.
      </Extract-Protocol>
    </Channel>
    <Data-Processing>
Illumina CASAVA v1.8.2  software used for basecalling and fastq file generation
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096) genome using bowtie2
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096)
Supplementary_files_format_and_content: tab-delimited text files in GCT format include read counts of uniquely and fraction of multiple mapped reads (counts.gct.gz), and normalized counts RPKM (rpkms.gct.gz) values for each sample
    </Data-Processing>
    <Platform-Ref ref="GPL20227" />
    <Library-Strategy>RNA-Seq</Library-Strategy>
    <Library-Source>transcriptomic</Library-Source>
    <Library-Selection>cDNA</Library-Selection>
    <Instrument-Model>
      <Predefined>Illumina HiSeq 2500</Predefined>
    </Instrument-Model>
    <Contact-Ref ref="contrib1" />
    <Supplementary-Data type="unknown">
NONE
    </Supplementary-Data>
    <Relation type="BioSample" target="https://www.ncbi.nlm.nih.gov/biosample/SAMN08466802" />
    <Relation type="SRA" target="https://www.ncbi.nlm.nih.gov/sra?term=SRX3648429" />
  </Sample>

这是我正在研究的解析器,但它缺少很多字段。

import xml.etree.ElementTree as ET
import pandas as pd

def read_geo_xml(path, index_name=None):
    # Parse the XML tree
    tree = ET.parse(path)
    root = tree.getroot()
    # Extract the attributes
    data = defaultdict(dict)
    for record in root:
        id_record = record.attrib["iid"]
        for x in record.findall("*"):
            for y in x:
                for k,v in y.attrib.items():
                    data[id_record][(k,v)] = y.text.strip()

    # Create pd.DataFrame
    df = pd.DataFrame(data).T
    df.index.name = index_name
    return df

url = "https://pastebin.com/raw/AJp5pshP"
import requests
from io import StringIO
text = requests.get("https://pastebin.com/raw/AJp5pshP").text
xml_data = StringIO(text)
df = read_geo_xml(xml_data)
df.head()
#   taxid   tag
# 679895    strain  type    moa phenotype   treatment time  treatment concentration
# GSM2978339    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978340    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978341    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978342    Escherichia coli BW25113    BW25113 Gram-negative bacteria  new hit EC90 of phenotype   ~ 25 min    50 uM
# GSM2978343    Escherichia coli BW25113    BW25113 Gram-negative bacteria  new hit EC90 of phenotype   ~ 25 min    50 uM

预期 output:

# Everything within a <field>  </field>
Submission-Date
Release-Date
Last-Update-Date
Title
Accession
Type
Channel-Count
Source
Organism
Treatment-Protocol
Growth-Protocol
Molecule
Data-Processing
Library-Strategy
Library-Source
Library-Selection
Instrument-Model
Supplemental Data

# Everything under <Characteristics>
strain
type
moa
phenotype
treatment time
treatment concentration

我目前只能从“特征”中提取

我将使用parsel提取标题数据,使用xpath

 data = """[ur data above]"""
    selector = Selector(data)

获取特征节点的数据:

    #all characteristics node have an attribute tag,
    #which is not found in the others, so I'll use that
    #characteristics
tags = []
contents = []
for ent in selector.xpath(".//sample//*[@tag]"):
    contents.append(ent.xpath("./text()").get().strip())
    tags.append(ent.attrib.get('tag'))
xters = dict(zip(tags,contents))

从其他节点获取数据,除了特征:

elements = []
vals = []

#this searches through the nodes and excludes characteristics
for ent in selector.xpath(".//sample//*[not(self::characteristics)]"):
    #some nodes have no text, so we have to cater to that
    if not ent.xpath("./text()").get():
        continue
    elements.append(ent.xpath("name(.)").get())
    vals.append(ent.xpath("./text()").get().strip())

#create dictionary from the two lists
#and append the xters dict to form one main dict
results = dict(zip(elements,vals))
results.update(xters)


print(results)

{'status': '',
 'submission-date': '2018-02-05',
 'release-date': '2019-03-25',
 'last-update-date': '2019-03-25',
 'title': 'PDD_P2_70',
 'accession': 'GSM2978341',
 'type': 'Gram-negative bacteria',
 'channel-count': '1',
 'channel': '',
 'source': 'AZ-LolCDE',
 'organism': 'Escherichia coli BW25113',
 'treatment-protocol': 'bacteria were treated with different antibiotics for ~ 25 min till  ~OD 0.2  in 2 ml tubes',
 'growth-protocol': 'bacteria were grown in iso-sensitest medium',
 'molecule': 'total RNA',
 'extract-protocol': "after treament bacteria were resuspended in QiaGen RNAprotect Bacteria Reagent (QiaGen #76506), incubated for 5min, centrifuged, and flash frozen on dry ice. Total RNA was extracted by incubating bacteria in Enzymatic Lysis Buffer  (lysozyme & proteinase K) for 5 min followed by addition of QiaGen RLT Lysis Buffer and RNA purification  using the QiaGen RNeasy Mini kit combined with DNase treatment on a solid support (QiaGen #74104). RNA quality assessment and quantification was performed using microfluidic chip analysis on an Agilent 2100 bioanalyzer (Agilent Technologies).\nFor RNA-sequencing library preparation, 1000 ng total RNA was used as input. First, bacterial ribosomal RNA was depleted using the Ribo-Zero Magnetic Kit Bacteria (Illumina #MRZB12424). After depletion, RNA was resuspended in TruSeq Total RNA Sample Prep Kit Fragmentation buffer (8.5 ul RNA and 8.5 buffer) and reversed transcribed into cDNA using random hexamer primer. Then cDNA was further processed for the construction of sequencing libraries according to the manufacturer's recommendations using the TruSeq Stranded mRNA Sample Prep Kit (Illimina #RS-122-2101). Sequencing was performed with the Illumina TruSeq SBS Kit v4-HS chemistry (Illumina #FC-401-4003) on an Illumina HiSeq2500 instrument with 50 cycles of 2x50 bp paired-end sequencing.",
 'data-processing': 'Illumina CASAVA v1.8.2  software used for basecalling and fastq file generation\nSequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096) genome using bowtie2\nReads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.\nGenome_build: Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096)\nSupplementary_files_format_and_content: tab-delimited text files in GCT format include read counts of uniquely and fraction of multiple mapped reads (counts.gct.gz), and normalized counts RPKM (rpkms.gct.gz) values for each sample',
 'library-strategy': 'RNA-Seq',
 'library-source': 'transcriptomic',
 'library-selection': 'cDNA',
 'instrument-model': '',
 'predefined': 'Illumina HiSeq 2500',
 'supplementary-data': 'NONE',
 'strain': 'BW25113',
 'moa': 'cell wall synthesis inhibitor / lipoprotein',
 'phenotype': 'EC90 of phenotype',
 'treatment time': '~ 25 min',
 'treatment concentration': '200 uM'}

您可以将您的数据读入 dataframe:

pd.DataFrame.from_dict(results,orient='index')

一个例子。

from simplified_scrapy import SimplifiedDoc, utils

def foo(ele, row):
  children = ele.children
  for a in ele:
      if a != 'html' and a != 'tag': row.append(ele[a])
  if children:
    for child in children:
      foo(child,row)
  elif ele['html']:
    row.append(ele['html'])

html = '''
<Sample iid="GSM2978341">
    <Status database="GEO">
      <Submission-Date>2018-02-05</Submission-Date>
      <Release-Date>2019-03-25</Release-Date>
      <Last-Update-Date>2019-03-25</Last-Update-Date>
    </Status>
    <Title>PDD_P2_70</Title>
    <Accession database="GEO">GSM2978341</Accession>
    <Type>SRA</Type>
</Sample>
'''
doc = SimplifiedDoc(html)
row = []
foo(doc,row)
print (row)

结果:

['GSM2978341', 'GEO', '2018-02-05', '2019-03-25', '2019-03-25', 'PDD_P2_70', 'GEO', 'GSM2978341', 'SRA']

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM