如何刪除 R 列中的前幾個字符？

Question

我的數據（csv 文件）有一列包含無意義的字符（例如特殊字符、隨機小寫字母），我想刪除它們。

df <- data.frame(Affiliation = c(". Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","**Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","aas Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","ac Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))

每行我要刪除的字符數（例如“.”、“**”、“aas”、“ac”）是不確定的，如上所示。

預期 output：

df <- data.frame(Affiliation = c("Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))

我正在考慮使用 dplyr 的變異 function，但我不確定如何使用 go。

Answer 1

如果我們假設有效文本從第一個大寫字母開始，則以下工作：

library(tidyverse)
df %>% 
  mutate(Affiliation = str_extract(Affiliation, "[:upper:].+"))

Answer 2

基礎 R 正則表達式解決方案：

df$cleaned_str <- gsub("^\\w+ |^\\*+|^\\. ", "", df$Affiliation)

Tidyverse 正則表達式解決方案：

library(tidyverse)
df %>% 
  mutate(Affiliation = str_replace(Affiliation, "^\\w+ |^\\*+|^\\. ", ""))

如何刪除 R 列中的前幾個字符？

問題描述

2 個解決方案

解決方案1
1 2020-04-29 04:29:40

解決方案2
0 2020-04-29 04:06:03

如何刪除 R 列中的前幾個字符？

問題描述

2 個解決方案

解決方案1 1 2020-04-29 04:29:40

解決方案2 0 2020-04-29 04:06:03

解決方案1
1 2020-04-29 04:29:40

解決方案2
0 2020-04-29 04:06:03