[英]Extract Text From Unstructured Medical Documents For NLP
I have a lot of unstructured medical documents in all sorts of different formats.我有很多各种不同格式的非结构化医疗文件。 What's the best way to parse out all the good sentences to use for NLP?
解析所有用于 NLP 的好句子的最佳方法是什么?
Currently I'm using SpaCy to do this, but even with multiprocessing it is pretty slow, and and the default sentence parser doesn't work 100% of the time.目前我正在使用 SpaCy 来做到这一点,但即使使用多处理它也很慢,而且默认的句子解析器不能 100% 地工作。 Here is an example of how I try and get good sentences with SpaCy:
这是我如何尝试使用 SpaCy 获得好句子的示例:
def get_good_sents(texts, batch_size, n_process):
nlp = spacy.load("en_core_web_sm", disable=[
'ner',
'entity_linker',
'textcat',
'entity_ruler',
'sentencizer',
'merge_noun_chunks',
'merge_entities',
'merge_subtokens',
])
pipe = nlp.pipe(texts, batch_size=batch_size, n_process=n_process)
rows = []
for doc in pipe:
clean_text = []
for sent in doc.sents:
struct = [token.pos_ for token in sent]
subject = any(x in struct for x in ['NOUN', 'PRON'])
action = any(x in struct for x in ['VERB', 'ADJ', 'AUX'])
if subject and action :
clean_text.append(sent.text)
rows.append(' '.join(clean_text).replace('\n', ' ').replace('\r', ''))
return rows
Example of some text extracts部分文本摘录示例
Raw Text:原始文本:
TITLE
Patient Name:
Has a heart Condition.
Is 70 Years old.
Expected Output:预期输出:
Has a heart Condition.
Is 70 Years old.
This examples not great because I have tons of different documents in all sort of various formats.这个例子不太好,因为我有大量各种格式的不同文档。 They can really vary a lot.
他们真的可以有很大的不同。 It basically boils down to me just wanting to strip out the boiler plate stuff and just get the actual free text.
它基本上归结为我只是想去掉样板的东西并获得实际的自由文本。
Based on the comments from the above discussion, I am very confident that spaCy will not provide you with very good results, simply because it is very much tied to the expectation of a valid grammatical sentence.基于以上讨论的评论,我非常有信心 spaCy 不会为您提供非常好的结果,仅仅因为它与有效语法句子的期望密切相关。
At least with the current approach of looking for "correctly tagged words" in each line, I would expect this to not work very well, since tagging a sentence correctly is already tied to a decent input format;至少使用当前在每一行中寻找“正确标记的单词”的方法,我希望这不会很好地工作,因为正确标记句子已经与合适的输入格式相关联; it is once again time to quote one of my favorite concepts in Machine Learning .
是时候再次引用我在机器学习中最喜欢的概念之一了。
Depending on the accuracy you want to achieve, I would personally adopt a defensive Regex approach, where you manually sort out headings (lines with fewer than 4 words, lines that end in a colon/semicolon, etc.), although it will require significantly more effort.根据您想要达到的准确性,我个人会采用防御性正则表达式方法,您可以手动整理标题(少于 4 个单词的行、以冒号/分号结尾的行等),尽管它需要大量更多的努力。
Another, more direct solution would be to take what other common boilerplate tools are doing , although most of those are targeted to remove boilerplate from HTML content, and thus have an easier time by utilizing tag information as well.另一个更直接的解决方案是采用其他常见的样板工具正在做的事情,尽管其中大多数旨在从 HTML 内容中删除样板,因此也可以更轻松地利用标签信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.