简体   繁体   中英

Extracting html tables to R

I have a list of static html files which contains tables. I need to extract one particular column (4th column) from that table. I am using R to extract the tables from the html, but the rows seem to get merged for that column (it seems it like the readHTMLTable api does not take in the
tags in the hrml). Any help will be appreciated

This is my R Code:

library('XML')
table<-readHTMLTable("C:\\Desktop\\TEST\\140.html")
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
output <- table[[which.max(n.rows)]]
output[,4] 

output:

[1] 214                                                                               
[2] 321/2/1                                                                           
[3] 321/5                                                                             
[4] 353/11/1/1/1                                                                      
[5] 141/1143/1 
[6] 319/3/1                 

Ideal output should be

[1] 214                                                                               
[2] 321/2/1                                                                           
[3] 321/5                                                                             
[4] 353/11/1/1/1                                                                      
[5] 141/1
[6] 143/1 
[7] 319/3/1     

the [6]th row just gets merged.

the sample of the html file is as follows:

  <table>
<tr> 
<td align="left" valign="top"><font face="Mangal"> 
  1
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    86/2/5
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.036<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    86/2/5<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.036<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    बिहारीलाल किसनलाल जुझारसिंह छगनलाल पिता भागमल<br>जाति रूवाला<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

 </tr>

  <tr> 
 <td align="left" valign="top"><font face="Mangal"> 
  2
  </font></td>
 <td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    214
    </font> </div></td>
 <td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.051<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
 <td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    214<br>
    </font></div></td>
 <td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.051<br>
    </font></div></td>
 <td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    गंगाराम पिता किशना<br>जाति कुमहार<br>पता नि.चोरखेडी<br>भूमि स्वामी<br>
    </font></div></td>

  </tr>

  <tr> 
  <td align="left" valign="top"><font face="Mangal"> 
  3
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    321/2/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.063<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    321/2/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.063<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    गंगाराम पिता घीसा<br>जाति खाती<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

  </tr>

  <tr> 
<td align="left" valign="top"><font face="Mangal"> 
  4
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    321/5
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.063<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    321/5<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.063<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    रामनारायण पिता घीसालाल<br>जाति खाती<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

 </tr>

 <tr> 
 <td align="left" valign="top"><font face="Mangal"> 
  5
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    353/11/1/1/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.127<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    353/11/1/1/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.127<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    दुर्गाप्रसाद पिता केलाश<br>जाति चमार<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

  </tr>

 <tr> 
<td align="left" valign="top"><font face="Mangal"> 
  6
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    141/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.136<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    141/1<br>143/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.009<br>0.127<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    जीतमल पिता गोरेलाल<br>जाति रूवाला<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

</tr>

 <tr> 
<td align="left" valign="top"><font face="Mangal"> 
  7
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    319/3/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.167<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    319/3/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.167<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    शिवनारायण पिता लक्ष्मनीरायण<br>जाति खाती<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

 </tr>
 </table> 

Too long for a comment.

The problem is in your HTML, not with readHTMLTable(...) .

First, your table has no header row, so you should probably use header=F in your call to readHTMLTable(...) . Treated that way, the table has 7 rows (eg, 7 instances of <tr> ... </tr> .

Second, in the 6th row, the 4th cell is:

<td>141/1<br>143/1<br></td>

That is, the contents of row 6, column 4 consists of two strings separated by line break characters. readHTMLTable(...) is correctly converting to "141/1143/1".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM