簡體   English   中英

如何通過html標簽或正則表達式拆分txt文件,以便將其另存為R中的單獨txt文件?

[英]How do I split a txt file by html tags or regex in order to save it as separate txt files in R?

我有html和txt格式的LexisNexis批量下載新聞文章的輸出。 該文件本身包含幾個不同新聞文章的標題,元數據和正文,我需要系統地分離並保存為獨立的txt文件。 txt版本的頭部看起來像:

> head(textz, 100)
[1] ""                                                                              
[2] "                               1 of 103 DOCUMENTS"                                
[3] ""                                                                                 
[4] ""                                                                                 

[5] "                                Foreign Affairs"                                  

[6] ""                                                                                 
[7] "                              May 2013 - June 2013"                               
[8] ""                                                                                 
[9] "Why the U.S. Army Needs Armor Subtitle: The Case for a Balanced Force"            
[10] ""                                                                                 

[11] "BYLINE: Chris McKinney, Mark Elfendahl, and H. R. McMaster Authors BIOS: CHRIS"   
[12] "MCKINNEY is a Lieutenant Colonel in the U.S. Army and an adviser to the Saudi"    
[13] "Arabian National Guard. MARK ELFENDAHL is a Colonel in the U.S. Army and a"       
[14] "student at the Joint Advanced Warfighting School in Norfolk, Virginia. H. R."     
[15] "MCMASTER is a Major General in the U.S. Army and Commander of the Maneuver"       
[16] "Center of Excellence at Fort Benning, Georgia."                                   

[17] ""                                                                                 

[18] "SECTION: Vol. 92 No. 4 PAGE: 129"                                                 

[19] ""                                                                                 

[20] "LENGTH: 2856 words"                                                               

[21] ""                                                                                 

[22] ""                                                                                 

[23] "Ever since World War II, the United States has depended on armored forces --"     
[24] "forces equipped with tanks and other protected vehicles -- to wage its wars."
....
....

html版本的快照如下所示:

<DOC NUMBER=103>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">103 of 103 DOCUMENTS</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">The New York Times</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c4">July</span>
<span class="c2"> 26, 2011 Tuesday</span>
<span class="c2">Â </span>
<span class="c2">Â <br>Late Edition - Final</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c7">A Step Toward Trust With China</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">BYLINE: </span><span class="c2">By MIKE MULLEN. </span></p>
<p class="c9"><span class="c2">Mike Mullen, a </span>
<span class="c4">Navy admiral,</span><span class="c2"> is the chairman of the Joint Chiefs of Staff.
</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">SECTION: </span>
<span class="c2">Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">LENGTH: </span>
<span class="c2">794 words</span></p>
</div>
<br><div class="c5">
<p class="c9"><span class="c2">Washington</span></p>
<p class="c9"><span class="c2">THE military relationship between the United States and China is one of the world's most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.
</span></p>

獨特的文檔由每個[0-9] [0-9] DOCUMENTS行分隔,但在grep系列和strsplit之間,我一直無法找到分割txt(或html)文件的方法。 R以一種干凈地分離組件文章的方式,允許我將它們保存為獨立的txt文件。 徹底搜索其他問題線程無論是無益還是需要使用Python。 任何建議都會很棒!

rvest讓它很容易解析HTML。 您的文檔與<DOCFULL><DOC NUMBER >標題不完全一致。 下面的答案使用您提供的文檔擴展來顯示下一個文檔(104)。 您可以使用lapply結構執行其他操作,例如每篇文章編寫一個文本文件。 請注意html_nodes中的css選擇器。 html中似乎沒有太多結構,但如果你找到一些模式,你可以用選擇器定位每篇文章的位。

library(rvest)
library(stringr)

articles  <- str_replace_all(doc, "\\n", " ") %>%    # remove new line to simplify
  str_replace_all("<DOCFULL>\\s+\\-\\->", " " ) %>%  # remove redundant header
  strsplit("<DOC NUMBER=\\d+>") %>%                  # split on DOC NUMBER header
  unlist()                                           # to a vector

# drop the first empty result form the split
articles <- articles[-1]

# use lapply to travers all articles. 
c2_texts <- lapply(articles, function (article) {
  article %>% 
    read_html() %>%           # character input parsed as html
    html_nodes(css=".c2") %>% # find nodes with CSS selector, ex: c2
    html_text() })            # extract text from within the node

c2_texts
# [[1]]
# [1] "103 of 103 DOCUMENTS"                                                                                                                                                                                                                                                                                                                                                           
# [2] "The New York Times"                                                                                                                                                                                                                                                                                                                                                             
# [3] " 26, 2011 Tuesday"                                                                                                                                                                                                                                                                                                                                                              
# [4] "Â "                                                                                                                                                                                                                                                                                                                                                                             
# [5] "Â Late Edition - Final"                                                                                                                                                                                                                                                                                                                                                         
# [6] "By MIKE MULLEN. "                                                                                                                                                                                                                                                                                                                                                               
# [7] "Mike Mullen, a "                                                                                                                                                                                                                                                                                                                                                                
# [8] " is the chairman of the Joint Chiefs of Staff.     "                                                                                                                                                                                                                                                                                                                            
# [9] "Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23"                                                                                                                                                                                                                                                                                                                 
# [10] "794 words"                                                                                                                                                                                                                                                                                                                                                                      
# [11] "Washington"                                                                                                                                                                                                                                                                                                                                                                     
# [12] "THE military relationship between the United States and China is one of the worlds most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.     "
# 
# [[2]]
# [1] "104 of 104 DOCUMENTS" "The Added Item"      

要拆分txt版本,假設文本在doc_text ,並將每個文本寫入順序命名的文件.txt,file2.txt等。

適用於@P Lapointe改編的文件寫作

texts <- unlist(strsplit(doc_text, "\\s+\\d+\\sof\\s\\d+\\sDOCUMENTS") )
texts <- texts[-1]  # drop the first empty split

lapply (1:length(texts), function(i){ write(texts[i], paste0("file", i, ".txt"))})

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM