简体   繁体   English

R-使用正则表达式在字符串中查找与查找字段匹配的名称

[英]R - find name in string that matches a lookup field using regex

I have a data frame of ad listings for pets: 我有一个宠物广告清单的数据框:

ID    Ad_title
1     1 year old ball python
2     Young red Blood python. - For Sale
3     1 Year Old Male Bearded Dragon - For Sale

I would like take the common name in the Ad_listing (ie ball pyton) and create a new field with the Latin name for the species. 我想在Ad_listing中使用通用名称(即Ball pyton),并用该物种的拉丁名称创建一个新字段。 To assist, I have another data frame that has the latin names and common names: 为了帮助您,我还有另一个数据框,其中包含拉丁名称和通用名称:

ID    Latin_name           Common_name
1     Python regius        E: Ball Python, Royal Python G: Königspython
2     Python brongersmai   E: Red Blood Python, Malaysian Blood Python
3     Pogona barbata       E: Eastern Bearded Dragon, Bearded Dragon

How can I go about doing this? 我该怎么做呢? The tricky part is that the common names are hidden in between text both in the ad listing and in the Common_name. 棘手的是,通用名称隐藏在广告列表和Common_name的文本之间。 If that were not the case I could just use %in%. 如果不是这种情况,我可以只使用%in%。 If there was a way/function to use regex I think that would be helpful. 如果有使用正则表达式的方法/功能,我认为这会有所帮助。

The other answer does a good job outlining the general logic, so here's a few thoughts on a simple (though not optimized!!) way to do this: 另一个答案很好地概述了一般逻辑,因此,以下是一些简单的方法(尽管未优化!):

First, you'll want to make a big table, two columns of all 'common names' (each name gets its own row) alongside it's Latin name. 首先,您需要制作一个大表,在所有两列“通用名称”(每个名称都有自己的行)旁边加上拉丁名称。 You could also make a dictionary here, but I like tables. 您也可以在这里制作字典,但我喜欢桌子。

    reference_table <- data.frame(common = c("cat", "kitty", "dog"), technical = c("feline", "feline", "canine"))

  common technical
1    cat    feline
2  kitty    feline
3    dog    canine

From here, just loop through every element of "ad_title" (use apply() or a for loop, depending on your preference). 从这里开始,循环遍历“ ad_title”的每个元素(根据您的喜好使用apply()或for循环)。 Now use something like this: 现在使用这样的东西:

apply(reference_table,1, function(X) {
if (length(grep(X$common, ad_title)) > 0){ #If the common name was found in the ad_title
[code to replace the string]})

For inserting the new string, play with your regular regex tools. 要插入新字符串,请使用常规的正则表达式工具。 Alternatively, play with strsplit(ad_title, X$common). 或者,玩strsplit(ad_title,X $ common)。 You'll be able to rebuild the ad_title using paste(), and the parts that make up the strsplit. 您将能够使用paste()以及组成strsplit的部分来重建ad_title。

Again, this is NOT the best way to do this, but hopefully the logic is simple. 同样,这不是执行此操作的最佳方法,但希望逻辑很简单。

Well, I tried to create a workable solution for your requirement. 好吧,我试图为您的需求创建一个可行的解决方案。 There could be better ways to execute it, though, probably using packages such as data.table and/or stringr . 但是,可能有更好的方法来执行它,可能使用诸如data.table和/或stringr Anyway, this snippet could be a working starting point. 无论如何,此片段可能是一个可行的起点。 Oh, and I modified the Ad_title data a bit so that the species names are in titlecase. 哦,我Ad_title修改了Ad_title数据,以使种类名称用大写字母表示。

# Re-create data
Ad_title <- c("1 year old Ball Python", "Young Red Blood Python. - For Sale",
              "1 Year Old Male Bearded Dragon - For Sale")
df2 <- data.frame(Latin_name = c("Python regius", "Python brongersmai", "Pogona barbata"),
                  Common_name = c("E: Ball Python, Royal Python G: Königspython",
                                  "E: Red Blood Python, Malaysian Blood Python",
                                  "E: Eastern Bearded Dragon, Bearded Dragon"),
                  stringsAsFactors = F)

# Aggregate common names
Common_name <- paste(df2$Common_name, collapse = ", ")
Common_name <- unlist(strsplit(Common_name, "(E: )|( G: )|(, )"))
Common_name <- Common_name[Common_name != ""]

# Data frame latin names vs common names
df3 <- data.frame(Common_name, Latin_name = sapply(Common_name, grep, df2$Common_name),
                  row.names = NULL, stringsAsFactors = F)
df3$Latin_name <- df2$Latin_name[df3$Latin_name]

# Data frame Ad vs common names
Ad_Common_name <- unlist(sapply(Common_name, grep, Ad_title))
df4 <- data.frame(Ad_title, Common_name = sapply(1:3, function(i) names(Ad_Common_name[Ad_Common_name==i])),
                  stringsAsFactors = F)

obviously you need a loop structure for all your common name lookup table and another loop that splits this compound field on comma, before doing simple regex. 显然,在执行简单的正则表达式之前,您需要为所有公用名查找表提供一个循环结构,并在逗号上拆分此复合字段的另一个循环。 there's no sane regex that will do it all. 没有理智的正则表达式可以完成所有操作。 in future avoid using packed/compound structures that require packing and unpacking. 将来避免使用需要打包和拆包的打包/复合结构。 it looks fine for human consumption but semantically and for computer program consumption, you have multiple data values packed in single field, ie it's not "common name" it's "common names" delimited by comma, that you have there. 它看起来很适合人类使用,但从语义上和计算机程序的使用上来看,您在单个字段中打包了多个数据值,即您拥有的不是“通用名”,而是用逗号分隔的“通用名”。

sorry if i haven't provided R or whatever specific answer. 对不起,如果我没有提供R或任何特定的答案。 I'm a technology veteran and use many languages/technologies depending on problem and available resources. 我是一名技术资深人士,并根据问题和可用资源使用多种语言/技术。 you will need to iterate over every record of your latin names lookup table, within which you will need to iterate over the comma delimited packed field of "common names", so you're working with one common name at a time. 您将需要遍历您的拉丁名称查找表的每个记录,在其中您将需要遍历以逗号分隔的“公用名”打包字段,因此您一次只能使用一个公用名。 with that single common name you search/replace using regex or whatever means available to you, over the whole input file. 使用这个单一的通用名称,您就可以在整个输入文件中使用正则表达式或可用的任何方式搜索/替换。 it's plain and simple that you need to start at it from that end, ie the lookup table. 很简单,您需要从此开始,即查找表。 you need to iterlate/loop through that. 您需要遍历/循环。 iteration/looping should be familiar to you, as it's a basic building block of any program/script. 迭代/循环应该是您熟悉的,因为它是任何程序/脚本的基本构建块。 this kind of procedural logic is not part of the capability (or desired functionality) of regex itself. 这种程序逻辑不是正则表达式自身功能(或所需功能)的一部分。 I assume you know how to create a iterative construct in R or whatever you're using for this. 我假设您知道如何在R中创建迭代构造或用于此目的的任何构造。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM