繁体   English   中英

如何通过html标签或正则表达式拆分txt文件,以便将其另存为R中的单独txt文件?

[英]How do I split a txt file by html tags or regex in order to save it as separate txt files in R?

我有html和txt格式的LexisNexis批量下载新闻文章的输出。 该文件本身包含几个不同新闻文章的标题,元数据和正文,我需要系统地分离并保存为独立的txt文件。 txt版本的头部看起来像:

> head(textz, 100)
[1] ""                                                                              
[2] "                               1 of 103 DOCUMENTS"                                
[3] ""                                                                                 
[4] ""                                                                                 

[5] "                                Foreign Affairs"                                  

[6] ""                                                                                 
[7] "                              May 2013 - June 2013"                               
[8] ""                                                                                 
[9] "Why the U.S. Army Needs Armor Subtitle: The Case for a Balanced Force"            
[10] ""                                                                                 

[11] "BYLINE: Chris McKinney, Mark Elfendahl, and H. R. McMaster Authors BIOS: CHRIS"   
[12] "MCKINNEY is a Lieutenant Colonel in the U.S. Army and an adviser to the Saudi"    
[13] "Arabian National Guard. MARK ELFENDAHL is a Colonel in the U.S. Army and a"       
[14] "student at the Joint Advanced Warfighting School in Norfolk, Virginia. H. R."     
[15] "MCMASTER is a Major General in the U.S. Army and Commander of the Maneuver"       
[16] "Center of Excellence at Fort Benning, Georgia."                                   

[17] ""                                                                                 

[18] "SECTION: Vol. 92 No. 4 PAGE: 129"                                                 

[19] ""                                                                                 

[20] "LENGTH: 2856 words"                                                               

[21] ""                                                                                 

[22] ""                                                                                 

[23] "Ever since World War II, the United States has depended on armored forces --"     
[24] "forces equipped with tanks and other protected vehicles -- to wage its wars."
....
....

html版本的快照如下所示:

<DOC NUMBER=103>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">103 of 103 DOCUMENTS</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">The New York Times</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c4">July</span>
<span class="c2"> 26, 2011 Tuesday</span>
<span class="c2">Â </span>
<span class="c2">Â <br>Late Edition - Final</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c7">A Step Toward Trust With China</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">BYLINE: </span><span class="c2">By MIKE MULLEN. </span></p>
<p class="c9"><span class="c2">Mike Mullen, a </span>
<span class="c4">Navy admiral,</span><span class="c2"> is the chairman of the Joint Chiefs of Staff.
</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">SECTION: </span>
<span class="c2">Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">LENGTH: </span>
<span class="c2">794 words</span></p>
</div>
<br><div class="c5">
<p class="c9"><span class="c2">Washington</span></p>
<p class="c9"><span class="c2">THE military relationship between the United States and China is one of the world's most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.
</span></p>

独特的文档由每个[0-9] [0-9] DOCUMENTS行分隔,但在grep系列和strsplit之间,我一直无法找到分割txt(或html)文件的方法。 R以一种干净地分离组件文章的方式,允许我将它们保存为独立的txt文件。 彻底搜索其他问题线程无论是无益还是需要使用Python。 任何建议都会很棒!

rvest让它很容易解析HTML。 您的文档与<DOCFULL><DOC NUMBER >标题不完全一致。 下面的答案使用您提供的文档扩展来显示下一个文档(104)。 您可以使用lapply结构执行其他操作,例如每篇文章编写一个文本文件。 请注意html_nodes中的css选择器。 html中似乎没有太多结构,但如果你找到一些模式,你可以用选择器定位每篇文章的位。

library(rvest)
library(stringr)

articles  <- str_replace_all(doc, "\\n", " ") %>%    # remove new line to simplify
  str_replace_all("<DOCFULL>\\s+\\-\\->", " " ) %>%  # remove redundant header
  strsplit("<DOC NUMBER=\\d+>") %>%                  # split on DOC NUMBER header
  unlist()                                           # to a vector

# drop the first empty result form the split
articles <- articles[-1]

# use lapply to travers all articles. 
c2_texts <- lapply(articles, function (article) {
  article %>% 
    read_html() %>%           # character input parsed as html
    html_nodes(css=".c2") %>% # find nodes with CSS selector, ex: c2
    html_text() })            # extract text from within the node

c2_texts
# [[1]]
# [1] "103 of 103 DOCUMENTS"                                                                                                                                                                                                                                                                                                                                                           
# [2] "The New York Times"                                                                                                                                                                                                                                                                                                                                                             
# [3] " 26, 2011 Tuesday"                                                                                                                                                                                                                                                                                                                                                              
# [4] "Â "                                                                                                                                                                                                                                                                                                                                                                             
# [5] "Â Late Edition - Final"                                                                                                                                                                                                                                                                                                                                                         
# [6] "By MIKE MULLEN. "                                                                                                                                                                                                                                                                                                                                                               
# [7] "Mike Mullen, a "                                                                                                                                                                                                                                                                                                                                                                
# [8] " is the chairman of the Joint Chiefs of Staff.     "                                                                                                                                                                                                                                                                                                                            
# [9] "Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23"                                                                                                                                                                                                                                                                                                                 
# [10] "794 words"                                                                                                                                                                                                                                                                                                                                                                      
# [11] "Washington"                                                                                                                                                                                                                                                                                                                                                                     
# [12] "THE military relationship between the United States and China is one of the worlds most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.     "
# 
# [[2]]
# [1] "104 of 104 DOCUMENTS" "The Added Item"      

要拆分txt版本,假设文本在doc_text ,并将每个文本写入顺序命名的文件.txt,file2.txt等。

适用于@P Lapointe改编的文件写作

texts <- unlist(strsplit(doc_text, "\\s+\\d+\\sof\\s\\d+\\sDOCUMENTS") )
texts <- texts[-1]  # drop the first empty split

lapply (1:length(texts), function(i){ write(texts[i], paste0("file", i, ".txt"))})

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM