I am trying to extract data from multi level structured XML file. The Input file will be
This is the search result of the query http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=24874852&retmode=xml&rettype=abstract&email=abc@xyz.com
Output of the query:
<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Status="Publisher" Owner="NLM">
<PMID Version="1">24874852</PMID>
<DateCreated>
<Year>2014</Year>
<Month>5</Month>
<Day>30</Day>
</DateCreated>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1976-670X</ISSN>
<JournalIssue CitedMedium="Internet">
<PubDate>
<Year>2014</Year>
<Month>May</Month>
<Day>30</Day>
</PubDate>
</JournalIssue>
<Title>BMB reports</Title>
<ISOAbbreviation>BMB Rep</ISOAbbreviation>
</Journal>
<ArticleTitle>
Human selenium binding protein-1 (hSP56) is a negative regulator of HIF-1α and suppresses the malignant characteristics of prostate cancer cells.
</ArticleTitle>
<Pagination>
<MedlinePgn/>
</Pagination>
<ELocationID EIdType="pii">2831</ELocationID>
<Abstract>
<AbstractText NlmCategory="UNLABELLED">
In the present study, we demonstrate that ectopic expression of 56-kDa human selenium binding protein-1 (hSP56) in PC-3 cells that do not normally express hSP56 results in a marked inhibition of cell growth in vitro and in vivo. Down-regulation of hSP56 in LNCaP cells that normally express hSP56 results in enhanced anchorage-independent growth. PC-3 cells expressing hSP56 exhibit a significant reduction of hypoxia inducible protein (HIF)-1α protein levels under hypoxic conditions without altering HIF-1α mRNA (HIF1A) levels. Taken together, our findings strongly suggest that hSP56 plays a critical role in prostate cells by mechanisms including negative regulation of HIF-1α, thus identifying hSP56 as a candidate anti-oncogene product.
</AbstractText>
</Abstract>
<AuthorList>
<Author>
<LastName>Jeong</LastName>
<ForeName>Jee-Yeong</ForeName>
<Initials>JY</Initials>
<Affiliation>
Laboratory for Cell and Molecular Biology, Division of Hematology and Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA; Department of Biochemistry and Cancer Research Institute, Kosin University College of Medicine, Busan, South Korea.
</Affiliation>
</Author>
<Author>
<LastName>Zhou</LastName>
<ForeName>Jin-Rong</ForeName>
<Initials>JR</Initials>
</Author>
<Author>
<LastName>Gao</LastName>
<ForeName>Chong</ForeName>
<Initials>C</Initials>
</Author>
<Author>
<LastName>Feldman</LastName>
<ForeName>Laurie</ForeName>
<Initials>L</Initials>
</Author>
<Author>
<LastName>Sytkowski</LastName>
<ForeName>Arthur J</ForeName>
<Initials>AJ</Initials>
</Author>
</AuthorList>
<Language>ENG</Language>
<PublicationTypeList>
<PublicationType>JOURNAL ARTICLE</PublicationType>
</PublicationTypeList>
<ArticleDate DateType="Electronic">
<Year>2014</Year>
<Month>5</Month>
<Day>30</Day>
</ArticleDate>
</Article>
<MedlineJournalInfo>
<MedlineTA>BMB Rep</MedlineTA>
<NlmUniqueID>101465334</NlmUniqueID>
<ISSNLinking>1976-6696</ISSNLinking>
</MedlineJournalInfo>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2014</Year>
<Month>5</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2014</Year>
<Month>5</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2014</Year>
<Month>5</Month>
<Day>31</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>aheadofprint</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pii">2831</ArticleId>
<ArticleId IdType="pubmed">24874852</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
My intention is to reorganise the data in another webpage. I am trying extract data from every layer of this structure. I am using regex. Eg, If I want to extract the abstract text from the xml structure, Here is the code I am using:
$o=urlencode("24874852");
$efetch = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&id=$o&retmode=xml&rettype=abstract&email=abc@xyz.com";
#echo $efetch;
$handle1 = file_get_contents($efetch,"r");
#echo $handle1s;
preg_match_all('/<AbstractText>\s*([0-9A-Za-z\.\_\n]+)\s*
<\/AbstractText>/s',$handle1,$abstext,PREG_PATTERN_ORDER)
foreach ($abstext[1] as $tiab){
echo $tiab; }`
I dont get the desired output that I expect. Any idea where it might have gone wrong?
If you are going to extract text from XML, the best option is to use an XML parser, such as a DOM parser:
$document = new DOMDocument();
$document->load( "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=24874852&retmode=xml&rettype=abstract&email=abc@xyz.com" );
From there you can use the XPath language to select the data you want to extract: //AbstractText
will return a set of all <AbstractText>
nodes.
You can use XPath in PHP on your parsed document:
$xpath = new DOMXpath($document);
To get all nodes you use:
$xpath->evaluate("//AbstractText")
And to extract the text from each node use nodeValue
:
foreach ($xpath->evaluate("//AbstractText") as $abstractText) {
echo $abstractText->nodeValue."\n";
}
See a working example using your data here: http://codepad.viper-7.com/nlryKH
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.