简体   繁体   English

如何修复此正则表达式以捕获字符串的特定字符?

[英]How to fix this regex in order to catch specific characters of a string?

I have a very_largeString that contains a list of words and and some id , i would like to extract all the words and it's id that have NC and AQ that morphologically ocurre consecutevely and print the rest of the id . 我有一个very_largeString ,它包含单词列表和一些id ,我想提取所有单词,并且它们的id具有NCAQ ,它们在词法上是按词义出现的,并打印其余的id For example: 例如:

very_largeString= ''' Hola hola I 1
compis compis NCMS000 0.500006
! ! Fat 1

esta este DD0FS0 0.986779
y y CC 0.999962
es ser VSIP3S0 1
que que CS 0.437483
es ser VSIP3S0 1
muy muy RG 1
sencilla sencillo AQ0FS0 1
de de SPS00 0.999984
utilizar utilizar VMN0000 1
, , Fc 1
que que CS 0.437483
si si CS 0.99954
nos nos PP1CP000 0.935743
ponen poner VMIP3P0 1
facilidad facilidad NCFS000 1
con con SPS00 1
las el DA0FP0 0.970954
tareas tarea NCFP000 1
de de SPS00 0.999984
la el DA0FS0 0.972269
casa casa NCFS000 0.979058
pues pues CS 0.998047
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
, , Fc 1
pero pero CC 0.999764
tan tan RG 1
antigua antiguo AQ0FS0 0.953488
que que CS 0.437483
según según SPS00 0.995943
mi mi DP1CSS 0.999101
madre madre NCFS000 1
era ser VSII1S0 0.491262
de de SPS00 0.999984
carga carga NCFS000 0.952569
superior superior AQ0CS0 0.992424
'''

this will be the desired output, since they have at the begining of the id the NC and AQ characters: 这将是所需的输出,因为它们在id的开头具有NCAQ字符:

[('carga', 'NCFS000', 'superior', 'AQ0CS0'), ('carga', 'NCFS000', 'frontal', 'AQ0CS0')]

How can i fix my regex in order to extract all the words that have as id AQ and NC ?. 我如何修复我的正则表达式以便提取所有具有id AQNC的单词? This is what i all ready tried: 这是我所有人准备尝试的内容:

regex_ = re.findall(r'^(\w+)\s\w+\s(NCFS000)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', very_largeString, re.M)

print regex_

The output is just the word and it´s associated id for example: 输出仅是单词及其相关的id ,例如:

 [('word','id'),('word','id')]
from pprint import pprint
import re
result = re.findall(r'''
    (?mx)              # Muti-line, verbose
    ^                  # Align to beginning of a line
    (\S+)\s+           # Grab first word
    \S+\s+             # Don't care about 2nd word
    (NC\S+)\s+         # 3rd word must have NC
    \S+\n              # End of first line
    ^                  # Next line is identical in form
    (\S+)\s+           # to the first line
    \S+\s+       
    (AQ\S+)\s+         # except 3rd word must have AQ
    \S+\n
''', very_largeString)
pprint (result)

My guess is you´re trying to do some NLP (Natural Language Processing), and you want to extract from some Spanish corpus the pairs composed by a noun and a qualifier . 我的猜测是,您正在尝试进行某种NLP(自然语言处理),并且您想要从一些西班牙语料库中提取由一个noun和一个qualifier noun组成的对。 There are already tools for such tasks. 已经有用于此类任务的工具。

I recomend you to take a look at Python Natural Language Tool Kit (NLTK). 我建议您看一下Python Natural Language Tool Kit (NLTK)。

Also I have to say is not a common task perform these operations on a corpus instead on completely natural text. 我还必须说,不是一个普通的任务,而是在完全自然的文本上对语料库执行这些操作。 I think you should explain your intensions, perhaps the solution you're trying to achive is not the best solution for your actual problem. 我认为您应该解释一下自己的意图,也许您试图达到的解决方案并不是解决您实际问题的最佳解决方案。

Help us to help you. 帮助我们来帮助您。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM