简体   繁体   English

从用于 NLP 的非结构化医学文档中提取文本

[英]Extract Text From Unstructured Medical Documents For NLP

I have a lot of unstructured medical documents in all sorts of different formats.我有很多各种不同格式的非结构化医疗文件。 What's the best way to parse out all the good sentences to use for NLP?解析所有用于 NLP 的好句子的最佳方法是什么?

Currently I'm using SpaCy to do this, but even with multiprocessing it is pretty slow, and and the default sentence parser doesn't work 100% of the time.目前我正在使用 SpaCy 来做到这一点,但即使使用多处理它也很慢,而且默认的句子解析器不能 100% 地工作。 Here is an example of how I try and get good sentences with SpaCy:这是我如何尝试使用 SpaCy 获得好句子的示例:

def get_good_sents(texts, batch_size, n_process):
    nlp = spacy.load("en_core_web_sm", disable=[
        'ner',
        'entity_linker',
        'textcat',
        'entity_ruler',
        'sentencizer',
        'merge_noun_chunks',
        'merge_entities',
        'merge_subtokens',
    ])
    pipe = nlp.pipe(texts, batch_size=batch_size, n_process=n_process)

    rows = []
    for doc in pipe:
        clean_text = []
        for sent in doc.sents:
            struct = [token.pos_ for token in sent]
            subject = any(x in struct for x in ['NOUN', 'PRON'])
            action = any(x in struct for x in ['VERB', 'ADJ', 'AUX'])

            if subject and action :
                clean_text.append(sent.text)
        rows.append(' '.join(clean_text).replace('\n', ' ').replace('\r', ''))

    return rows

Example of some text extracts部分文本摘录示例

Raw Text:原始文本:

TITLE
Patient Name:
Has a heart Condition.
Is 70 Years old.

Expected Output:预期输出:

Has a heart Condition.
Is 70 Years old.

This examples not great because I have tons of different documents in all sort of various formats.这个例子不太好,因为我有大量各种格式的不同文档。 They can really vary a lot.他们真的可以有很大的不同。 It basically boils down to me just wanting to strip out the boiler plate stuff and just get the actual free text.它基本上归结为我只是想去掉样板的东西并获得实际的自由文本。

Based on the comments from the above discussion, I am very confident that spaCy will not provide you with very good results, simply because it is very much tied to the expectation of a valid grammatical sentence.基于以上讨论的评论,我非常有信心 spaCy 不会为您提供非常好的结果,仅仅因为它与有效语法句子的期望密切相关。

At least with the current approach of looking for "correctly tagged words" in each line, I would expect this to not work very well, since tagging a sentence correctly is already tied to a decent input format;至少使用当前在每一行中寻找“正确标记的单词”的方法,我希望这不会很好地工作,因为正确标记句子已经与合适的输入格式相关联; it is once again time to quote one of my favorite concepts in Machine Learning .是时候再次引用我在机器学习中最喜欢的概念之一了

Depending on the accuracy you want to achieve, I would personally adopt a defensive Regex approach, where you manually sort out headings (lines with fewer than 4 words, lines that end in a colon/semicolon, etc.), although it will require significantly more effort.根据您想要达到的准确性,我个人会采用防御性正则表达式方法,您可以手动整理标题(少于 4 个单词的行、以冒号/分号结尾的行等),尽管它需要大量更多的努力。

Another, more direct solution would be to take what other common boilerplate tools are doing , although most of those are targeted to remove boilerplate from HTML content, and thus have an easier time by utilizing tag information as well.另一个更直接的解决方案是采用其他常见的样板工具正在做的事情,尽管其中大多数旨在从 HTML 内容中删除样板,因此也可以更轻松地利用标签信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以使用 NLP 从非结构化文本块中提取特定({Architect, Building})信息? - Is it possible to extract specific({Architect, Building}) information from unstructured text chunks using NLP? 非结构化医学文本的实体属性提取 - Entity Attribute Extraction On Unstructured Medical Text 使用NLP从文本中提取关联的值 - Extract associated values from text using NLP NLP-从文本中提取类别/标签 - NLP - extract categories/tags from text 如何使用 Python NLP 从句子列表中提取特定单词。 这些词是医疗设备的零件 - How to extract particular word(s) from the list of sentences using Python NLP. These word(s) are Parts of Medical equipments 使用NLP /语义相似度从大型文档中提取与一组预定义准则相关的关键字的方法 - Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity 从PDF文档中仅提取粗体文本 - Extract only bold text from PDF documents 从 pdf 和文档中提取文本和元数据 - Extract Text and Metadata from pdfs and documents 从python中的xml文档中提取文本 - extract text from xml documents in python 使用 BeautifulSoup 从非结构化表行中提取地址 - Extract address from unstructured table row with BeautifulSoup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM