简体   繁体   English

从R中两个符号之间的段落中提取文本

[英]Extracting text from a paragraph between two symbols in R

I have a paragraph of text and I would like to extract every sample size from it. 我有一段文字,我想从中提取每个样本大小。 I usually can get Regex to work, but I am unable to. 我通常可以让Regex工作,但我无法做到。

Here is an example: 这是一个例子:

x = "OBJECTIVES:

In diabetic polyneuropathy (DPN) patients, the effect of folic acid and homocysteine has been related to components of nerve conduction velocity (NCV). The objective of this study was to determine the effect of folic acid supplementation on NCV in DPN patients.
METHODS:

Patients were randomized to receive either 1 mg of folic acid (n = 40) or placebo (n = 40) for 16 weeks. Blood samples were collected to assess serum folic acid and homocysteine concentrations, and NCV was performed for assessment of diabetic neuropathy.
RESULTS:

At 16 weeks, in the supplemented group, serum levels of folic acid (p < 0.001) increased, homocysteine concentrations decreased (p < 0.001), with no change in serum vitamin B12 levels. There was a significant increase in sensory sural amplitude (p < 0.001), and components of motor nerves, including amplitude (p = 0.001) and velocity (p < 0.001), but decreased onset latency of peroneal (p = 0.019) and tibial (p = 0.011) motor nerves.
CONCLUSION:

Our data suggest that supplementation with 1 mg of folic acid for 16 weeks may be useful for enhancing NCV in DPN patients."

I would like to extract out the two sample sizes. 我想提取出两个样本量。 In this case n = 40 and n = 40. 在这种情况下,n = 40且n = 40。

I have tried 我努力了

gsub('.*[n=]|).*','',x)

I get back ts. 我回来了ts.

Here's one way to extract those values 这是提取这些值的一种方法

regmatches(x, gregexpr('n\\s*=\\s*\\d+',x))

here we look for n= (with possible spaces around the equals sign) and then extract those with the regmatches . 在这里我们寻找n =(在等号周围可能有空格),然后用regmatches提取那些。

An ugly solution without regex could be: 没有正则表达式的丑陋解决方案可能是:

#first "n = "
substr(strsplit(x, "n = ",fixed=T)[[1]][2],1,2)
#second "n = "
substr(strsplit(x, "n = ",fixed=T)[[1]][3],1,2)

You could use stringr to extract "n = " followed by at least one digit. 您可以使用stringr提取“n =”,后跟至少一个数字。 This assumes there will be no-space or one space either side of the equal sign: 这假设等号的两边都没有空格或一个空格:

library(stringr)
str_extract_all(x, "n\\s?\\=\\s?\\d+")
[[1]]
[1] "n = 40" "n = 40"

EDIT: The following should work inside mutate with your other condition. 编辑:以下应该在mutate与你的其他条件。 I switched from stringr to stringi to get NA for a string with no matches. 我从stringr切换到stringi ,为没有匹配的字符串获取NA Also, you could use paste instead of stri_flatten but I would stick with stri_flatten because it retains NA as a missing value and not a character "NA" like paste does. 此外,您可以使用paste而不是stri_flatten但我会坚持使用stri_flatten因为它将NA保留为缺失值,而不像paste那样保留字符"NA"

sapply(stri_extract_all(x, regex = "n\\s?\\=\\s?\\d+"), stri_flatten, collapse = ", ")

For regex I started with this cheat sheet for R (and still reference it). 对于正则表达式,我从R的这个备忘单开始(仍然参考它)。 The above regex works like so: 上面的正则表达式如下:

n - the letter n n - 字母n

\\\\s? - at most 1(the ? ) space( \\\\s ) (you may prefer MrFlick's use of * over the ? --your call) - 最多1个( ? )空间( \\\\s )(你可能更喜欢MrFlick在?你的电话中使用*

\\\\= - equal sign \\\\= - 等号

\\\\s? - at most 1(the ? ) space( \\\\s ) - 最多1个( ? )空间( \\\\s

\\\\d+ - one or more( + ) digits( \\\\d ) \\\\d+ - 一个或多个( + )数字( \\\\d

Data : 数据

x = c("OBJECTIVES:

In diabetic polyneuropathy (DPN) patients, the effect of folic acid and homocysteine has been related to components of nerve conduction velocity (NCV). The objective of this study was to determine the effect of folic acid supplementation on NCV in DPN patients.
METHODS:

Patients were randomized to receive either 1 mg of folic acid (n = 40) or placebo (n = 40) for 16 weeks. Blood samples were collected to assess serum folic acid and homocysteine concentrations, and NCV was performed for assessment of diabetic neuropathy.
RESULTS:

At 16 weeks, in the supplemented group, serum levels of folic acid (p < 0.001) increased, homocysteine concentrations decreased (p < 0.001), with no change in serum vitamin B12 levels. There was a significant increase in sensory sural amplitude (p < 0.001), and components of motor nerves, including amplitude (p = 0.001) and velocity (p < 0.001), but decreased onset latency of peroneal (p = 0.019) and tibial (p = 0.011) motor nerves.
CONCLUSION:

Our data suggest that supplementation with 1 mg of folic acid for 16 weeks may be useful for enhancing NCV in DPN patients.", "no numbers here", "n = 100")

A way to retrive the number between parenthesis is like this: 检索括号之间的数字的方法是这样的:

library(stringr)

lapply(str_split(x,pattern="\\("),function(x) gsub('(.*)\\).*','\\1',x))
[[1]]
 [1] "OBJECTIVES:\n\nIn diabetic polyneuropathy "
 [2] "DPN"                                       
 [3] "NCV"                                       
 [4] "n = 40"                                    
 [5] "n = 40"                                    
 [6] "p < 0.001"                                 
 [7] "p < 0.001"                                 
 [8] "p < 0.001"                                 
 [9] "p = 0.001"                                 
[10] "p < 0.001"                                 
[11] "p = 0.019"                                 
[12] "p = 0.011"    

You split the text using \\( as a pattern and apply the gsub in each piece of it. Afterwards you can use grep to identify which elements start with "n =" to retrieve the one you need. 您使用\\分割文本(作为模式并在每个部分中应用gsub。然后您可以使用grep来识别哪个元素以“n =”开头以检索您需要的元素。

I hope it helps 我希望它有所帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM