簡體   English   中英

如何從 R 或 Python 上帶有特定單詞列表的文本文件中過濾出句子?

[英]How can I filter out sentences from a text file with specific word list on R or Python?

我努力從 EDGAR S-1 財務披露中正確過濾掉帶有特定術語列表的句子 RStudio。

S-1 文件中的示例文本。

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

示例術語列表可以是來自以下向量的內容。

terms_list = c("institutions", "disaster", "error",...)

重點是編輯和覆蓋當前文本文件以刪除不包含特定單詞或術語的句子,例如提到的那些。

過濾和覆蓋后,文本應如下所示。

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks. 

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students. 

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures. 

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems. "

如果你的數據是一長串,在 R 你可以:

  1. 使用string::str_split拆分字符串
  2. 使用paste來組合搜索詞
  3. 重新組合字符串

使用您的數據的示例,讀入為:

strng <- "We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

這里每個句子都用\n\n分隔。 所以我們可以在那個模式上拆分字符串。 如果您的實際數據中有另一種模式,只需替換(即句點)。

strngSplit <- stringr::str_split(strng, "\\\n\\\n")[[1]]

# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "In addition, any significant failure of our computer networks could disrupt our on-campus operations."                                                                                                                                                                               
# [5] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [6] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [7] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."
# [8] "As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations."                                                                                       
# [9] "As a result, our revenues and profitability may be materially adversely affected."  

確定搜索詞

terms_list <- c("institutions", "disaster", "error")

查找包含搜索詞的句子

idx <- grep(paste0(terms_list, collapse = "|"), strngSplit)
# [1] 1 2 3 5 6 7

您可以將其保留為一個向量(向量的 position 中的每個句子)或將其組合回一個段落:

strngVec <- strngSplit[idx]
# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [5] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [6] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."

# or

strngParagraph <- paste(strngSplit[idx], collapse = "\n\n")
#[1] "We run the online operations of our institutions on different platforms, which are in various stages of development. \n\nThe performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. \n\nAny computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.\n\nIndividual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.\n\nAdditionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.\n\nThe disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM