简体   繁体   English

如何将 txt 文件导入 R 并根据某些条件将文本分成列

[英]How do I import a txt file into R and separate text into columns based on certain criteria

I have some job descriptions saved in a txt file format.我有一些以 txt 文件格式保存的职位描述。 The job title, job description, job title, etc are all lumped together and I am trying to separate them into columns.职位、职位描述、职位等都集中在一起,我试图将它们分成几列。 The text is about 5 pages long.正文长约5页。 Here is a sample of how the text is structured -这是文本结构的示例 -

EXECUTIVE LEVEL
001 Chief Executive Officer: Job description of CEO.
040 Area Director: This line contains job description of the Area Director.

FINANCE TEAM
025 Chief Operating Officer: This line contains job description of the Chief Operating Officer
055 Chief Financial Officer: This person controls operations of the company and reports to the COO

MARKETING TEAM
056 Marketing Director: This person is in charge of the marketing team. Blab la bla

I would like to create a dataframe (or is it called tibble these days?) with 4 columns -我想创建一个有 4 列的 dataframe (或者它现在被称为 tibble 吗?) -

column 1 - The team name (Executive Level, Finance Team, Marketing Team, etc)第 1 列 - 团队名称(执行级别、财务团队、营销团队等)

column 2 - Team number (001, 040 025, 055, etc)第 2 列 - 团队编号(001、040 025、055 等)

column 3 - The job title (Chief Executive Officer, Chief Operating Officer, etc)第 3 栏 - 职位(首席执行官、首席运营官等)

column 4 - The job description第 4 栏 - 职位描述

Thanks in advance提前致谢

x2 <- x[nzchar(x)]
x3 <- split(x2, cumsum(grepl("^[A-Z]", x2)))
x4 <- lapply(x3, function(z) transform(strcapture("^([0-9]+)\\s+([^:]+):\\s*(.*)$", z[-1], list(num="", title="", desc="")), name=z[1]))
x5 <- do.call(rbind, x4)
x5
#     num                   title                                                                  desc            name
# 1.1 001 Chief Executive Officer                                               Job description of CEO. EXECUTIVE LEVEL
# 1.2 040           Area Director              This line contains job description of the Area Director. EXECUTIVE LEVEL
# 2.1 025 Chief Operating Officer     This line contains job description of the Chief Operating Officer    FINANCE TEAM
# 2.2 055 Chief Financial Officer This person controls operations of the company and reports to the COO    FINANCE TEAM
# 3   056      Marketing Director           This person is in charge of the marketing team. Blab la bla  MARKETING TEAM

Data, likely the results of x <- readLines(path_to_file) .数据,可能是x <- readLines(path_to_file)的结果。

x <- c("EXECUTIVE LEVEL", "001 Chief Executive Officer: Job description of CEO.", "040 Area Director: This line contains job description of the Area Director.", "", "FINANCE TEAM", "025 Chief Operating Officer: This line contains job description of the Chief Operating Officer", "055 Chief Financial Officer: This person controls operations of the company and reports to the COO", "", "MARKETING TEAM", "056 Marketing Director: This person is in charge of the marketing team. Blab la bla")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM