簡體   English   中英

使用 XLM 數據集的 NLP

[英]NLP using XLM dataset

我正在嘗試對包含以下行的數據集進行 NLP

00001 B 74457
00002 C 12804123 16026213 14627885
00004 A 15329425 9058342 11279767

其中行中的第一個元素是標識符第二個是標簽推薦,它只能有三個標簽$A,B,C$,示例中的數字12804123代表XML的id,它包含數據,例如,文本,位置等。基於此,我需要從 XML 文件中提取數據並使用它來制作模型。 所以首先我想從 XML 文件中提取一些數據,並制作一個結構數據的數據框。 下面是 XML 文件的示例。 當我運行命令 pd.read_xml(xml) 它給出

    medlinecitation     pubmeddata
0   NaN     NaN

來自 Kaggle 或任何其他來源等的任何示例我都可以進行分析。

74457.xml = '''
<pubmedarticleset>
<pubmedarticle>
<medlinecitation owner="NLM" status="MEDLINE">
<pmid version="1"> 74457 </pmid>
<datecreated>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecreated>
<datecompleted>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecompleted>
<daterevised>
<year> 2007 </year>
<month> 11 </month>
<day> 15 </day>
</daterevised>
<article pubmodel="Print">
<journal>
<issn issntype="Print"> 0140-6736 </issn>
<journalissue citedmedium="Print">
<volume> 1 </volume>
<issue> 7984 </issue>
<pubdate>
<year> 1976 </year>
<month> Sep </month>
<day> 4 </day>
</pubdate>
</journalissue>
<title> Lancet </title>
<isoabbreviation> Lancet </isoabbreviation>
</journal>
<articletitle>
Prophylactic treatment of alcoholism by lithium carbonate. A controlled study.
</articletitle>
<pagination>
<medlinepgn> 481-2 </medlinepgn>
</pagination>
<abstract>
<abstracttext>
Lithium therapy has been shown to have a therapeutic influence in reducing the drinking and incapacity by alcohol in depressive alcoholics in a prospective double-blind placebo-controlled trial conducted over one year, but it had no significant effect on non-depressed patients. Patients in the trial treated by placebo had significantly greater alcoholic morbidity if they were depressive than if they were non-depressive.
</abstracttext>
</abstract>
<authorlist completeyn="Y">
<author validyn="Y">
<lastname> Merry </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Reynolds </lastname>
<forename> C M </forename>
<initials> CM </initials>
</author>
<author validyn="Y">
<lastname> Bailey </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Coppen </lastname>
<forename> A </forename>
<initials> A </initials>
</author>
</authorlist>
<language> eng </language>
<publicationtypelist>
<publicationtype> Clinical Trial </publicationtype>
<publicationtype> Comparative Study </publicationtype>
<publicationtype> Journal Article </publicationtype>
<publicationtype> Randomized Controlled Trial </publicationtype>
</publicationtypelist>
</article>
<medlinejournalinfo>
<country> ENGLAND </country>
<medlineta> Lancet </medlineta>
<nlmuniqueid> 2985213R </nlmuniqueid>
<issnlinking> 0140-6736 </issnlinking>
</medlinejournalinfo>
<chemicallist>
<chemical>
<registrynumber> 0 </registrynumber>
<nameofsubstance> Placebos </nameofsubstance>
</chemical>
<chemical>
<registrynumber> 7439-93-2 </registrynumber>
<nameofsubstance> Lithium </nameofsubstance>
</chemical>
</chemicallist>
<citationsubset> AIM </citationsubset>
<citationsubset> IM </citationsubset>
<meshheadinglist>
<meshheading>
<descriptorname majortopicyn="N"> Adult </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcohol Drinking </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcoholism </descriptorname>
<qualifiername majortopicyn="Y"> drug therapy </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Clinical Trials as Topic </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Depression </descriptorname>
<qualifiername majortopicyn="N"> chemically induced </qualifiername>
<qualifiername majortopicyn="Y"> prevention & control </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Double-Blind Method </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Drug Evaluation </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Female </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Humans </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Lithium </descriptorname>
<qualifiername majortopicyn="Y"> therapeutic use </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Male </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Middle Aged </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Placebos </descriptorname>
</meshheading>
</meshheadinglist>
</medlinecitation>
<pubmeddata>
<history>
<pubmedpubdate pubstatus="pubmed">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
</pubmedpubdate>
<pubmedpubdate pubstatus="medline">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 1 </minute>
</pubmedpubdate>
<pubmedpubdate pubstatus="entrez">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 0 </minute>
</pubmedpubdate>
</history>
<publicationstatus> ppublish </publicationstatus>
<articleidlist>
<articleid idtype="pubmed"> 74457 </articleid>
</articleidlist>
</pubmeddata>
</pubmedarticle>
</pubmedarticleset>'''

請幫助我了解發生了什么? 我怎樣才能使它成為一個數據框?

這是一種方法:

import pandas as pd

try:
    medlinecitation = pd.read_xml("74457.xml", xpath=".//medlinecitation").dropna(
        axis=1
    )
except ValueError:
    medlinecitation = pd.DataFrame()

try:
    pubmedpubdate = pd.read_xml("74457.xml", xpath=".//pubmedpubdate")
except ValueError:
    pubmedpubdate = pd.DataFrame()

df = pd.merge(
    left=medlinecitation,
    right=pubmedpubdate,
    how="outer",
    left_index=True,
    right_index=True,
).fillna(method="ffill")
print(df)
# Output
  owner   status     pmid citationsubset pubstatus  year  month  day  hour  \
0   NLM  MEDLINE  74457.0             IM    pubmed  1976      9    4   NaN   
1   NLM  MEDLINE  74457.0             IM   medline  1976      9    4   0.0   
2   NLM  MEDLINE  74457.0             IM    entrez  1976      9    4   0.0   

   minute  
0     NaN  
1     1.0  
2     0.0 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM