简体   繁体   English

将html表提取到R

[英]Extracting html tables to R

I have a list of static html files which contains tables. 我有一个包含表的静态html文件列表。 I need to extract one particular column (4th column) from that table. 我需要从该表中提取一个特定的列(第4列)。 I am using R to extract the tables from the html, but the rows seem to get merged for that column (it seems it like the readHTMLTable api does not take in the 我正在使用R从html提取表,但该行似乎已合并为该列(似乎readHTMLTable api不会占用
tags in the hrml). hrml中的代码)。 Any help will be appreciated 任何帮助将不胜感激

This is my R Code: 这是我的R代码:

library('XML')
table<-readHTMLTable("C:\\Desktop\\TEST\\140.html")
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
output <- table[[which.max(n.rows)]]
output[,4] 

output: 输出:

[1] 214                                                                               
[2] 321/2/1                                                                           
[3] 321/5                                                                             
[4] 353/11/1/1/1                                                                      
[5] 141/1143/1 
[6] 319/3/1                 

Ideal output should be 理想的输出应该是

[1] 214                                                                               
[2] 321/2/1                                                                           
[3] 321/5                                                                             
[4] 353/11/1/1/1                                                                      
[5] 141/1
[6] 143/1 
[7] 319/3/1     

the [6]th row just gets merged. 第[6]行刚刚合并。

the sample of the html file is as follows: html文件的示例如下:

  <table>
<tr> 
<td align="left" valign="top"><font face="Mangal"> 
  1
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    86/2/5
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.036<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    86/2/5<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.036<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    बिहारीलाल किसनलाल जुझारसिंह छगनलाल पिता भागमल<br>जाति रूवाला<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

 </tr>

  <tr> 
 <td align="left" valign="top"><font face="Mangal"> 
  2
  </font></td>
 <td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    214
    </font> </div></td>
 <td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.051<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
 <td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    214<br>
    </font></div></td>
 <td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.051<br>
    </font></div></td>
 <td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    गंगाराम पिता किशना<br>जाति कुमहार<br>पता नि.चोरखेडी<br>भूमि स्वामी<br>
    </font></div></td>

  </tr>

  <tr> 
  <td align="left" valign="top"><font face="Mangal"> 
  3
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    321/2/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.063<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    321/2/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.063<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    गंगाराम पिता घीसा<br>जाति खाती<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

  </tr>

  <tr> 
<td align="left" valign="top"><font face="Mangal"> 
  4
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    321/5
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.063<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    321/5<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.063<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    रामनारायण पिता घीसालाल<br>जाति खाती<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

 </tr>

 <tr> 
 <td align="left" valign="top"><font face="Mangal"> 
  5
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    353/11/1/1/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.127<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    353/11/1/1/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.127<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    दुर्गाप्रसाद पिता केलाश<br>जाति चमार<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

  </tr>

 <tr> 
<td align="left" valign="top"><font face="Mangal"> 
  6
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    141/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.136<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    141/1<br>143/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.009<br>0.127<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    जीतमल पिता गोरेलाल<br>जाति रूवाला<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

</tr>

 <tr> 
<td align="left" valign="top"><font face="Mangal"> 
  7
  </font></td>
<td height="29" align="left" valign="top"><div align="left"><font face="Mangal"> 
    319/3/1
    </font> </div></td>
<td align="left" valign="top" bordercolor="#CCCCCC"> 
  0.167<br>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div>
  <div align="left"></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    319/3/1<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    0.167<br>
    </font></div></td>
<td align="left" valign="top"> <div align="left"><font face="Mangal"> 
    शिवनारायण पिता लक्ष्मनीरायण<br>जाति खाती<br>पता निवासी ग्राम<br>भूमि स्वामी<br>
    </font></div></td>

 </tr>
 </table> 

Too long for a comment. 评论太久了。

The problem is in your HTML, not with readHTMLTable(...) . 问题出在您的HTML中,而不是readHTMLTable(...)

First, your table has no header row, so you should probably use header=F in your call to readHTMLTable(...) . 首先,您的表没有标题行,因此您应该在对readHTMLTable(...)调用中使用header=F Treated that way, the table has 7 rows (eg, 7 instances of <tr> ... </tr> . 以这种方式对待,该表具有7行(例如, <tr> ... </tr> 7个实例)。

Second, in the 6th row, the 4th cell is: 第二,在第六行中,第四单元格为:

<td>141/1<br>143/1<br></td>

That is, the contents of row 6, column 4 consists of two strings separated by line break characters. 也就是说,第6行第4列的内容由两个由换行符分隔的字符串组成。 readHTMLTable(...) is correctly converting to "141/1143/1". readHTMLTable(...)正确转换为“ 141/1143/1”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM