简体   繁体   English

从 SEC Edgar 文件中清除 readtext 和 quanteda 中的标签

[英]Clean tags from SEC Edgar filings in readtext and quanteda

I am trying to read .txt files into R using readtext and quanteda that I have parsed from the SEC Edgar database of publicly listed firm filings.我正在尝试使用 readtext 和 quanteda 将 .txt 文件读取到 R 中,这些文件是我从 SEC Edgar 公开上市公司文件数据库中解析出来的。 An example of the .txt file is here and a more user friendly version is here for comparison (PG&E during the Californian wildfires). .txt 文件的一个例子在这里,一个更用户友好的版本在这里进行比较(加利福尼亚野火期间的 PG​​&E)。

My code is the following, for a folder of year 1996, containing many .txt files:我的代码如下,对于 1996 年的文件夹,包含许多 .txt 文件:

directory<-("D:")
text <- readtext(paste0(directory,"/1996/*.txt"))
corpus<-corpus(text)
dfm<-dfm(corpus,tolower=TRUE,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE)

I notice that the dfm still contains a lot of 'useless' tokens, such as 'font-style', 'italic', and at the end many useless tokens such as '3eyn' and 'kq', which I think are part of the .jpg part at the bottom of the .txt file.我注意到 dfm 仍然包含很多“无用”标记,例如“字体样式”、“斜体”,最后还有许多无用标记,例如“3eyn”和“kq”,我认为它们是其中的一部分.txt 文件底部的 .jpg 部分。

When I encode the documents when using readtext, the problem still persists, for example when doing:当我在使用 readtext 时对文档进行编码时,问题仍然存在,例如在执行以下操作时:

text<-readtext(paste0(directory,"/*.txt"),encoding="UTF-8")
text<-readtext(paste0(directory,"/*.txt"),encoding="ASCII")

Any help on how to clean these files so that they appear more like the user friendly version above (ie contain only the main text) is much appreciated.非常感谢有关如何清理这些文件以使它们看起来更像上面的用户友好版本(即仅包含正文)的任何帮助。

The key here is to find the marker in the text that indicates the start of the text that you want, and the marker that indicates where this ends.这里的关键是在文本中找到指示您想要的文本开始的标记,以及指示文本结束位置的标记。 This can be a set of conditions separated in regex using |这可以是在正则表达式中使用|分隔的一组条件| . .

Nothing before the first marker is kept (by default), and you can remove the text following the ending marker by dropping that from the corpus using corpus_subset() .在第一个标记之前不保留任何内容(默认情况下),您可以通过使用corpus_subset()从语料库中删除结尾标记后面的文本。 The actual patterns will no doubt require tweaking after you discover the variety of patterns in your actual data.在您发现实际数据中的各种模式后,实际模式无疑需要进行调整。

Here's how I did it for your sample document:以下是我为您的示例文档所做的操作:

library("quanteda")
## Package version: 2.0.0

corp <- readtext::readtext("https://www.sec.gov/Archives/edgar/data/75488/000114036117038612/0001140361-17-038612.txt") %>%
  corpus()

# clean up text
corp <- gsub("<.*?>|&#\\d+;", "", corp)
corp <- gsub("&amp;", "&", corp)

corp <- corpus_segment(corp,
  pattern = "Item 8\\.01 Other Events\\.|SIGNATURES",
  valuetype = "regex"
) %>%
  corpus_subset(pattern != "SIGNATURES")

print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 1 docvar.
## 0001140361-17-038612.txt.1 :
## "Investigation of Northern California Fires   Since October 8, 2017, several catastrophic wildfires have started and remain active in Northern California. The causes of these fires are being investigated by the California Department of Forestry and Fire Protection (Cal Fire), including the possible role of power lines and other facilities of Pacific Gas and Electric Companys (the Utility), a subsidiary of PG&E Corporation.   It currently is unknown whether the Utility would have any liability associated with these fires. The Utility has approximately $800 million in liability insurance for potential losses that may result from these fires. If the amount of insurance is insufficient to cover the Utility's liability or if insurance is otherwise unavailable, PG&E Corporations and the Utilitys financial condition or results of operations could be materially affected."

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM