繁体   English   中英

如何从 R 或 Python 上带有特定单词列表的文本文件中过滤出句子?

[英]How can I filter out sentences from a text file with specific word list on R or Python?

我努力从 EDGAR S-1 财务披露中正确过滤掉带有特定术语列表的句子 RStudio。

S-1 文件中的示例文本。

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

示例术语列表可以是来自以下向量的内容。

terms_list = c("institutions", "disaster", "error",...)

重点是编辑和覆盖当前文本文件以删除不包含特定单词或术语的句子,例如提到的那些。

过滤和覆盖后,文本应如下所示。

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks. 

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students. 

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures. 

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems. "

如果你的数据是一长串,在 R 你可以:

  1. 使用string::str_split拆分字符串
  2. 使用paste来组合搜索词
  3. 重新组合字符串

使用您的数据的示例,读入为:

strng <- "We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

这里每个句子都用\n\n分隔。 所以我们可以在那个模式上拆分字符串。 如果您的实际数据中有另一种模式,只需替换(即句点)。

strngSplit <- stringr::str_split(strng, "\\\n\\\n")[[1]]

# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "In addition, any significant failure of our computer networks could disrupt our on-campus operations."                                                                                                                                                                               
# [5] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [6] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [7] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."
# [8] "As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations."                                                                                       
# [9] "As a result, our revenues and profitability may be materially adversely affected."  

确定搜索词

terms_list <- c("institutions", "disaster", "error")

查找包含搜索词的句子

idx <- grep(paste0(terms_list, collapse = "|"), strngSplit)
# [1] 1 2 3 5 6 7

您可以将其保留为一个向量(向量的 position 中的每个句子)或将其组合回一个段落:

strngVec <- strngSplit[idx]
# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [5] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [6] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."

# or

strngParagraph <- paste(strngSplit[idx], collapse = "\n\n")
#[1] "We run the online operations of our institutions on different platforms, which are in various stages of development. \n\nThe performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. \n\nAny computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.\n\nIndividual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.\n\nAdditionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.\n\nThe disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM