简体   繁体   中英

NLP using XLM dataset

I am trying to do NLP on the dataset consisting of the following row

00001 B 74457
00002 C 12804123 16026213 14627885
00004 A 15329425 9058342 11279767

where 1st element in the row is the identifier 2nd on is a label recommends, it can have only three labels $A, B, C$ and the number for examples 12804123 represent the id of the XML, it contains data, for example, text, location, etc. Based on this I need to extract the data from the XML file and use it to make a model. So first of all I want to extract some of the data from the XML file and make a data frame of structure data. An example of the XML file is below. When I run the command pd.read_xml(xml) it gives

    medlinecitation     pubmeddata
0   NaN     NaN

Any example from Kaggle or any other source etc I can follow to do the analysis.

74457.xml = '''
<pubmedarticleset>
<pubmedarticle>
<medlinecitation owner="NLM" status="MEDLINE">
<pmid version="1"> 74457 </pmid>
<datecreated>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecreated>
<datecompleted>
<year> 1978 </year>
<month> 03 </month>
<day> 21 </day>
</datecompleted>
<daterevised>
<year> 2007 </year>
<month> 11 </month>
<day> 15 </day>
</daterevised>
<article pubmodel="Print">
<journal>
<issn issntype="Print"> 0140-6736 </issn>
<journalissue citedmedium="Print">
<volume> 1 </volume>
<issue> 7984 </issue>
<pubdate>
<year> 1976 </year>
<month> Sep </month>
<day> 4 </day>
</pubdate>
</journalissue>
<title> Lancet </title>
<isoabbreviation> Lancet </isoabbreviation>
</journal>
<articletitle>
Prophylactic treatment of alcoholism by lithium carbonate. A controlled study.
</articletitle>
<pagination>
<medlinepgn> 481-2 </medlinepgn>
</pagination>
<abstract>
<abstracttext>
Lithium therapy has been shown to have a therapeutic influence in reducing the drinking and incapacity by alcohol in depressive alcoholics in a prospective double-blind placebo-controlled trial conducted over one year, but it had no significant effect on non-depressed patients. Patients in the trial treated by placebo had significantly greater alcoholic morbidity if they were depressive than if they were non-depressive.
</abstracttext>
</abstract>
<authorlist completeyn="Y">
<author validyn="Y">
<lastname> Merry </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Reynolds </lastname>
<forename> C M </forename>
<initials> CM </initials>
</author>
<author validyn="Y">
<lastname> Bailey </lastname>
<forename> J </forename>
<initials> J </initials>
</author>
<author validyn="Y">
<lastname> Coppen </lastname>
<forename> A </forename>
<initials> A </initials>
</author>
</authorlist>
<language> eng </language>
<publicationtypelist>
<publicationtype> Clinical Trial </publicationtype>
<publicationtype> Comparative Study </publicationtype>
<publicationtype> Journal Article </publicationtype>
<publicationtype> Randomized Controlled Trial </publicationtype>
</publicationtypelist>
</article>
<medlinejournalinfo>
<country> ENGLAND </country>
<medlineta> Lancet </medlineta>
<nlmuniqueid> 2985213R </nlmuniqueid>
<issnlinking> 0140-6736 </issnlinking>
</medlinejournalinfo>
<chemicallist>
<chemical>
<registrynumber> 0 </registrynumber>
<nameofsubstance> Placebos </nameofsubstance>
</chemical>
<chemical>
<registrynumber> 7439-93-2 </registrynumber>
<nameofsubstance> Lithium </nameofsubstance>
</chemical>
</chemicallist>
<citationsubset> AIM </citationsubset>
<citationsubset> IM </citationsubset>
<meshheadinglist>
<meshheading>
<descriptorname majortopicyn="N"> Adult </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcohol Drinking </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Alcoholism </descriptorname>
<qualifiername majortopicyn="Y"> drug therapy </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Clinical Trials as Topic </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Depression </descriptorname>
<qualifiername majortopicyn="N"> chemically induced </qualifiername>
<qualifiername majortopicyn="Y"> prevention & control </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Double-Blind Method </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Drug Evaluation </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Female </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Humans </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Lithium </descriptorname>
<qualifiername majortopicyn="Y"> therapeutic use </qualifiername>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Male </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Middle Aged </descriptorname>
</meshheading>
<meshheading>
<descriptorname majortopicyn="N"> Placebos </descriptorname>
</meshheading>
</meshheadinglist>
</medlinecitation>
<pubmeddata>
<history>
<pubmedpubdate pubstatus="pubmed">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
</pubmedpubdate>
<pubmedpubdate pubstatus="medline">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 1 </minute>
</pubmedpubdate>
<pubmedpubdate pubstatus="entrez">
<year> 1976 </year>
<month> 9 </month>
<day> 4 </day>
<hour> 0 </hour>
<minute> 0 </minute>
</pubmedpubdate>
</history>
<publicationstatus> ppublish </publicationstatus>
<articleidlist>
<articleid idtype="pubmed"> 74457 </articleid>
</articleidlist>
</pubmeddata>
</pubmedarticle>
</pubmedarticleset>'''

Please help me to understand what is happening? And how can I make it a data frame?

Here is one way to do it:

import pandas as pd

try:
    medlinecitation = pd.read_xml("74457.xml", xpath=".//medlinecitation").dropna(
        axis=1
    )
except ValueError:
    medlinecitation = pd.DataFrame()

try:
    pubmedpubdate = pd.read_xml("74457.xml", xpath=".//pubmedpubdate")
except ValueError:
    pubmedpubdate = pd.DataFrame()

df = pd.merge(
    left=medlinecitation,
    right=pubmedpubdate,
    how="outer",
    left_index=True,
    right_index=True,
).fillna(method="ffill")
print(df)
# Output
  owner   status     pmid citationsubset pubstatus  year  month  day  hour  \
0   NLM  MEDLINE  74457.0             IM    pubmed  1976      9    4   NaN   
1   NLM  MEDLINE  74457.0             IM   medline  1976      9    4   0.0   
2   NLM  MEDLINE  74457.0             IM    entrez  1976      9    4   0.0   

   minute  
0     NaN  
1     1.0  
2     0.0 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM