简体   繁体   English

如何使用正则表达式从 NLTK 语料库中查找大写字母单词?

[英]How to find a capital letter words from an NLTK corpus using regex?

I'd like to make a word list with a regular expression which is consists of all capital letters.我想用一个由所有大写字母组成的正则表达式制作一个单词列表。 the data set is a bunch of biological theses text files called corpus.数据集是一堆称为语料库的生物论文文本文件。

The result for len(corpus.fileids()) is 487 which means that there are 487 theses in the corpus. len(corpus.fileids())的结果是 487,这意味着语料库中有 487 篇论文。

The main reason for this is to collect word list to filter biological words like gene name and etc(ATP, BRCA)这样做的主要原因是收集词表来过滤基因名称等生物词(ATP,BRCA)

here are some codes that I've been trying.这是我一直在尝试的一些代码。 (ps I'm using python3) (ps我正在使用python3)

I'm stuck with making functions to call out all files in the corpus.我坚持使用函数来调用语料库中的所有文件。 for a single file, I think this would work.对于单个文件,我认为这会起作用。

capital = re.findall(r'[A-Z]+', GNICorpus)

but the thing is that I have to go through all the words in the theses txt files in the corpus and have no idea.但问题是我必须通过语料库中的论文 txt 文件中的所有单词来 go 并且不知道。 1st trial第一次审判

import re
import nltk
from nltk.corpus import*
x = [
    (file)
    for file in Corpus.fileids() 
    for w in Corpus.words(file) 
    if w.upper()
]

2nd trial第二次审判

   capital = re.findall(r'[A-Z]+', Corpus)
   capital

Third trial第三次审判

for fileid in Corpus.fileids():
    words = Corpus.words(fileid)
    capital = re.findall(r'[A-Z]+', words)

Your regex would match a single (or many) capital letters in a word.您的正则表达式将匹配一个单词中的单个(或多个)大写字母。

For example:例如:

Corpus = "These are SOME words and someTHAT shouldNot match"
result = re.findall(r'[A-Z]+', Corpus)
>> ['T', 'SOME', 'THAT', 'N']

You'd be better off using this regex (where \b is a word boundary):你最好使用这个正则表达式(其中 \b 是单词边界):

# \b[A-Z]+\b

Corpus = "These are SOME words and someTHAT shouldNot match"
result = re.findall(r'\b[A-Z]+\b', Corpus)
>> ["SOME"]

But this all depends on what you are looking for.但这一切都取决于你在寻找什么。

You might not need a regex for this purpose, but in your use case it's faster =)您可能不需要为此目的使用正则表达式,但在您的用例中它更快 =)

Assuming that the input is a text, this works假设输入是文本,这有效

import re 

text = "These are SOME words and someTHAT shouldNot match"
result = re.findall(r'\b[A-Z]+\b', text)

Using https://docs.python.org/3/library/stdtypes.html#str.isupper , this works too:使用https://docs.python.org/3/library/stdtypes.html#str.isupper ,这也有效:

text = "These are SOME words and someTHAT shouldNot match"
result = [word for word in text.split() if word.isupper()]

Assuming that the GNICorpus假设GNICorpus

From https://www.ncbi.nlm.nih.gov/pubmed/30309207 and most probably from https://github.com/Ewha-Bio/Genomics-Informatics-Corpus来自https://www.ncbi.nlm.nih.gov/pubmed/30309207很可能来自https://github.com/Ewha-Bio/Genomics-Informatics-Corpus

from nltk.corpus import PlaintextCorpusReader

root_dir = 'Genomics-Informatics-Corpus/GNI Corpus 1.0'
GNICorpus = PlaintextCorpusReader(root_dir, '.*\.txt', encoding='utf-8')

The GNICorpus object has a .raw() function that will concatenate all files in the object and return a single str . GNICorpus object 有一个.raw() function 将连接 object 中的所有文件并返回单个str

>>> type(GNICorpus.raw())
str

In that case, the regex can be applied to the raw strings, eg在这种情况下,正则表达式可以应用于原始字符串,例如

re.findall(r'\b[A-Z]+\b', GNICorpus.raw())

And to use the str.isupper function instead of the regex, it's possible to loop through every token in the corpus object with the .words() function, ie要使用str.isupper function 而不是正则表达式,可以使用.words() ZC1C425268E68385D14AB5074C17A9 即循环遍历语料库 object 中的每个标记,

[word for word in GNICorpus.words() if word.isupper()]

You'll find that the regex is much faster than iterating through the .words() .您会发现正则表达式比遍历.words()快得多。 There's quite a lot of regex vs Python's native string discussion online, if you're interested.如果您有兴趣,在线上有很多正则表达式与 Python 的原生字符串讨论。

But wait?可是等等? What if we can find "genes sequence" with this approach?如果我们能用这种方法找到“基因序列”呢?

Instead of [AZ] , we can specify the [ATCG] sequence:代替[AZ] ,我们可以指定[ATCG]序列:

from collections import Counter
Counter(re.findall(r'\b[ATCG]+\b', GNICorpus.raw()))

[out]: [出去]:

Counter({'G': 1249,
         'CA': 958,
         'A': 6558,
         'CCCTC': 21,
         'C': 2981,
         'T': 1284,
         'CTCT': 3,
         'AG': 64,
         'AACC': 3,
         'AA': 28,
         'CC': 131,
         'TCGA': 122,
         'GT': 427,
         'GAGGGAGGGAGCGAGA': 3,
         'GC': 122,
         'GA': 102,
         'TGACGTCA': 3,
         'TCA': 15,
         'GCG': 4,
         'AGG': 12,
         'ACA': 3,
         'GCA': 3,
         'GTG': 3,
         'AGT': 6,
         'GAA': 3,
         'GAC': 18,
         'AGA': 6,
         'ACC': 7,
         'CTT': 11,
         'TGC': 12,
         'AGC': 3,
         'TCC': 7,
         'TTC': 6,
         'GTT': 4,
         'ACG': 12,
         'ATG': 4,
         'GAG': 9,
         'GGG': 3,
         'AAT': 3,
         'ACAGC': 3,
         'AT': 35,
         'TG': 270,
         'GGTCAACAAATCATAAAGATATTGG': 3,
         'TAAACTTCAGGGTGACCAAAAAATCA': 3,
         'TA': 17,
         'CT': 61,
         'CGC': 20,
         'TC': 100,
         'GG': 23,
         'CG': 15,
         'AC': 6,
         'CCCTCT': 4,
         'TT': 124,
         'CAGT': 3,
         'TCTG': 3,
         'ATCC': 61,
         'AAAAACAACAAGATAA': 3,
         'GATA': 6,
         'CACCC': 3,
         'ATC': 275,
         'GGCGCCATCTT': 3,
         'TCTGAGCC': 3,
         'CGCC': 3,
         'GCTA': 7,
         'AAA': 2,
         'AAG': 30,
         'GTA': 4,
         'ATT': 4,
         'AAC': 4,
         'CAT': 8,
         'GGC': 6,
         'TAA': 6,
         'TTT': 9,
         'CAG': 3,
         'TGG': 3,
         'CCT': 11,
         'CTC': 273,
         'CCG': 12,
         'GCT': 271,
         'TAG': 3,
         'TAT': 7,
         'CAC': 6,
         'TAC': 8,
         'TCG': 3,
         'TTG': 6,
         'ATA': 21,
         'TTAGGG': 3,
         'CACTA': 3,
         'TATA': 19,
         'CCA': 31,
         'CCC': 4,
         'CGT': 3,
         'CGA': 3,
         'CGG': 3,
         'GTC': 268,
         'GGCAGG': 246,
         'CGTGCCCCAGCCCAGTC': 1,
         'TTCCAGTACAGCCCATCCAATAAG': 1,
         'TGCGAGGGCTGCGAGGTC': 1,
         'TGTCAGCTTGCGTGTGGTTGC': 1,
         'GTAACCCGTTGCACCCCATT': 1,
         'CCATCCAATCGGTAGTAGCG': 1,
         'GACGATGCTCCCCGGGCTGTATTC': 1,
         'TCTCTTGCTCTGGGCCTCGTCACC': 1,
         'TCTTAACTGCCGGATCCACAAAAA': 1,
         'ATCTCCGCCAACAGCTTCTCCTTC': 1,
         'GGGCAGCCTCCGTTTGATGGT': 1,
         'CGCTTGGCAGGGTGTTTGGTC': 1,
         'GCCATCGAGGAGTGCCAATACC': 1,
         'GGCCACACCTGCTGAAGAGATG': 1,
         'GTAGCCCCAGTGGAGAGCCTTGTG': 1,
         'ATGCCAGTGGGGAGTTTGTTATCG': 1,
         'TGAATCGGACCCACTTGAGAGG': 1,
         'CAGGAGCGGCTTGTTTGAGGTA': 1,
         'GGAGGCGCCGAGACTTAGGT': 1,
         'GCGGGTGAGCACAGCAGAGC': 1,
         'TCATCCCGAATAAAAGCGAAGAGC': 1,
         'AGGGCAACAACATTAGCAGGAGAT': 1,
         'GATGTGATCCGACATTACA': 1,
         'CTAGAACTGCTCTGTATGT': 1,
         'CAATTCGGCAAGTAATGGA': 1,
         'GTCTCTTCGGGAACTGCAAG': 1,
         'TGGGACACAGGCACTGTAGA': 1,
         'GCTCTCTGCTCCTCCTGTTC': 1,
         'CAATACGACCAAATCCGTTG': 1,
         'ATCG': 10,
         'TCGT': 2,
         'TGAT': 1,
         'CGTG': 1,
         'CGTT': 1,
         'CATC': 1,
         'GTGA': 1,
         'ATCGT': 4,
         'TCGTG': 1,
         'TCGTT': 1,
         'CGTGA': 1,
         'CATCG': 1,
         'GTGAT': 1,
         'CGTGAT': 1,
         'CATCGT': 1,
         'TCGTGA': 1,
         'TCGTGAT': 1,
         'ATCGTGACT': 1,
         'CGTGATT': 2,
         'GTGACT': 1,
         'ATCGTT': 1,
         'ATCGTGAGA': 1,
         'GTGAAG': 1,
         'GTGATTG': 1,
         'GTGATT': 1,
         'TCGTGACT': 3,
         'TCGATTG': 3,
         'TCGTGAGA': 3,
         'TTACT': 3,
         'ACT': 5,
         'ATTG': 2,
         'GATTG': 1,
         'TGTGTAGAGCTCCTCG': 1,
         'TTAAA': 1,
         'GGCG': 1,
         'TACCTGCATGCTGCGGTGAAG': 1,
         'AGGGCTGTGTAGAAGTACTCGC': 1,
         'TTTT': 2,
         'AATAAA': 1,
         'TCGTGCA': 1,
         'TCTACCTCGACAG': 1,
         'CCTCCTCCT': 1,
         'CCTTGGTTTTC': 1,
         'GAAATCCCATCACCATCTTCCAGG': 1,
         'GAGCCCCAGCCTTCTCCATG': 1,
         'AACACCA': 1,
         'CGCTCCCGCCTTACTTCGCA': 1,
         'TTAGCTTGCCTCGTCCCC': 1,
         'TTTCGACACTGGATGGCG': 1,
         'TTGCGTTGCGTAGGGGGGAT': 1,
         'TTTAAA': 2,
         'GATATC': 1,
         'AGTATC': 1,
         'CGTCTGTGAGGGGAGCGTTT': 1,
         'TGATTTTGATGACGAGCGTAAT': 1,
         'GATGTGAGAACTGTATCCTAGCAAG': 1,
         'GGCTGGCCTGTTGAACAAGTCTGGA': 1,
         'ATAC': 1,
         'GTCGGAGTCAACGGATTTG': 1,
         'TGGGTGGAATCA': 1,
         'TATTGGA': 1,
         'AGAAAAAGCAACCACGAAGC': 1,
         'AAACCTCTGTCTGTGAGTGCC': 1,
         'TATT': 1,
         'ACCC': 1,
         'GCCA': 15,
         'CAAT': 1,
         'AGAC': 11,
         'GCTCCCGCCTTACTTCGCAT': 1,
         'CGGGGACGAGGCAAGCTAA': 1,
         'GCCGCCATCCAGTGTCG': 1,
         'TTGCGTTGCGTAGGGGGG': 1})

What if I want to set a minimum no.如果我想设置一个最小值怎么办。 of character sequence?字符序列?

If we want to set the minimum character to 4, instead of + , you can use {4,} :如果我们想将最小字符设置为 4,而不是+ ,您可以使用{4,}

from collections import Counter
Counter(re.findall(r'\b[ATCG]{4,}\b', GNICorpus.raw()))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM