繁体   English   中英

使用 XLM 数据集的 NLP

[英]NLP using XLM dataset

我正在尝试对包含以下行的数据集进行 NLP

00001 B 74457
00002 C 12804123 16026213 14627885
00004 A 15329425 9058342 11279767

其中行中的第一个元素是标识符第二个是标签推荐,它只能有三个标签$A,B,C$,示例中的数字12804123代表XML的id,它包含数据,例如,文本,位置等。基于此,我需要从 XML 文件中提取数据并使用它来制作模型。 所以首先我想从 XML 文件中提取一些数据,并制作一个结构数据的数据框。 下面是 XML 文件的示例。 当我运行命令 pd.read_xml(xml) 它给出

    medlinecitation     pubmeddata
0   NaN     NaN

来自 Kaggle 或任何其他来源等的任何示例我都可以进行分析。

74457.xml = '''
<pubmedarticleset>
<pubmedarticle>
<medlinecitation owner="NLM" status="MEDLINE">
<pmid version="1"> 74457 </pmid>
<datecreated>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecreated>
<datecompleted>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecompleted>
<daterevised>
<year> 2007 </year>
<month> 11 </month>
<day> 15 </day>
</daterevised>
<article pubmodel="Print">
<journal>
<issn issntype="Print"> 0140-6736 </issn>
<journalissue citedmedium="Print">
<volume> 1 </volume>
<issue> 7984 </issue>
<pubdate>
<year> 1976 </year>
<month> Sep </month>
<day> 4 </day>
</pubdate>
</journalissue>
<title> Lancet </title>
<isoabbreviation> Lancet </isoabbreviation>
</journal>
<articletitle>
Prophylactic treatment of alcoholism by lithium carbonate. A controlled study.
</articletitle>
<pagination>
<medlinepgn> 481-2 </medlinepgn>
</pagination>
<abstract>
<abstracttext>
Lithium therapy has been shown to have a therapeutic influence in reducing the drinking and incapacity by alcohol in depressive alcoholics in a prospective double-blind placebo-controlled trial conducted over one year, but it had no significant effect on non-depressed patients. Patients in the trial treated by placebo had significantly greater alcoholic morbidity if they were depressive than if they were non-depressive.
</abstracttext>
</abstract>
<authorlist completeyn="Y">
<author validyn="Y">
<lastname> Merry </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Reynolds </lastname>
<forename> C M </forename>
<initials> CM </initials>
</author>
<author validyn="Y">
<lastname> Bailey </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Coppen </lastname>
<forename> A </forename>
<initials> A </initials>
</author>
</authorlist>
<language> eng </language>
<publicationtypelist>
<publicationtype> Clinical Trial </publicationtype>
<publicationtype> Comparative Study </publicationtype>
<publicationtype> Journal Article </publicationtype>
<publicationtype> Randomized Controlled Trial </publicationtype>
</publicationtypelist>
</article>
<medlinejournalinfo>
<country> ENGLAND </country>
<medlineta> Lancet </medlineta>
<nlmuniqueid> 2985213R </nlmuniqueid>
<issnlinking> 0140-6736 </issnlinking>
</medlinejournalinfo>
<chemicallist>
<chemical>
<registrynumber> 0 </registrynumber>
<nameofsubstance> Placebos </nameofsubstance>
</chemical>
<chemical>
<registrynumber> 7439-93-2 </registrynumber>
<nameofsubstance> Lithium </nameofsubstance>
</chemical>
</chemicallist>
<citationsubset> AIM </citationsubset>
<citationsubset> IM </citationsubset>
<meshheadinglist>
<meshheading>
<descriptorname majortopicyn="N"> Adult </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcohol Drinking </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcoholism </descriptorname>
<qualifiername majortopicyn="Y"> drug therapy </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Clinical Trials as Topic </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Depression </descriptorname>
<qualifiername majortopicyn="N"> chemically induced </qualifiername>
<qualifiername majortopicyn="Y"> prevention & control </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Double-Blind Method </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Drug Evaluation </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Female </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Humans </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Lithium </descriptorname>
<qualifiername majortopicyn="Y"> therapeutic use </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Male </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Middle Aged </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Placebos </descriptorname>
</meshheading>
</meshheadinglist>
</medlinecitation>
<pubmeddata>
<history>
<pubmedpubdate pubstatus="pubmed">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
</pubmedpubdate>
<pubmedpubdate pubstatus="medline">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 1 </minute>
</pubmedpubdate>
<pubmedpubdate pubstatus="entrez">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 0 </minute>
</pubmedpubdate>
</history>
<publicationstatus> ppublish </publicationstatus>
<articleidlist>
<articleid idtype="pubmed"> 74457 </articleid>
</articleidlist>
</pubmeddata>
</pubmedarticle>
</pubmedarticleset>'''

请帮助我了解发生了什么? 我怎样才能使它成为一个数据框?

这是一种方法:

import pandas as pd

try:
    medlinecitation = pd.read_xml("74457.xml", xpath=".//medlinecitation").dropna(
        axis=1
    )
except ValueError:
    medlinecitation = pd.DataFrame()

try:
    pubmedpubdate = pd.read_xml("74457.xml", xpath=".//pubmedpubdate")
except ValueError:
    pubmedpubdate = pd.DataFrame()

df = pd.merge(
    left=medlinecitation,
    right=pubmedpubdate,
    how="outer",
    left_index=True,
    right_index=True,
).fillna(method="ffill")
print(df)
# Output
  owner   status     pmid citationsubset pubstatus  year  month  day  hour  \
0   NLM  MEDLINE  74457.0             IM    pubmed  1976      9    4   NaN   
1   NLM  MEDLINE  74457.0             IM   medline  1976      9    4   0.0   
2   NLM  MEDLINE  74457.0             IM    entrez  1976      9    4   0.0   

   minute  
0     NaN  
1     1.0  
2     0.0 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM