Databricks Spark Scala AWS 中的 XML 解析使用 MAVEN - DailyMed 的 HL7 V3 文件

Question

從DailyMed 中提取人類處方標簽文件- 下載所有葯物標簽。 這些文件采用的 .xml 格式是 HL7 V3 格式，已證明難以解析請參閱 Databricks 中 AWS 集群中 MAVEN XML 解析的安裝說明，盡管我的集群上安裝了正確的庫。 有人將這些文件類型從 .xml 格式正確解析為 spark 數據幀的任何提示或示例？

我目前的方法包括檢索所有文件並將它們存儲在 dbfs 中。

%scala
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils

FileUtils.copyURLToFile(new URL("https://dailymed-data.nlm.nih.gov/public-release-files/dm_spl_release_human_rx_part1.zip"), new File("/dbfs/FileStore/your_path_here/dm_spl_release_human_rx_part1.zip"))

解壓下載的文件

%sh
unzip -vu '/dbfs/FileStore/your_path_here/dm_spl_release_human_rx_part1.zip'  -d /dbfs/FileStore/your_path_here/

解壓已解壓文件中的 zip 文件（初始）

%sh
for file in /dbfs/FileStore/your_path_here/prescription/*.zip
do 
unzip -j $file '*.xml' -d /dbfs/FileStore/your_path_here/xml/
done

由於 .xml HL7 V3 格式的獨特格式，從這里開始解析變得困難。 試圖轉換為 .json 但遇到了特殊字符問題。 現在訴諸於刪除特殊字符並繼續將 .xml 解析為 spark 數據幀。 關於某人如何在 Spark Scala 中執行此操作的任何提示都會很棒！

這是對嘗試讀取和結果消息的更新。

import com.databricks.spark.xml.schema_of_xml
import spark.implicits._

val df = spark.read.format("xml").load("/FileStore/your_path_here/xml/ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml")
// val payloadSchema = schema_of_xml(df.select("payload").as[String])
// val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))

df.show()

在此處輸入圖片說明

Answer 1

因此，我的團隊中有一位開發人員 ( https://github.com/gardnmi ) 幫助解析 .xml 文檔，最終將它們傳遞給數據幀。 他做得很好！ 把它放在這里是希望其他人能夠使用它/為它做出貢獻。

%python 
import pandas as pd
import numpy as np
from xml.dom import minidom
import pathlib
import os
import fnmatch
import lxml
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
from collections import defaultdict

directory = '/dbfs/FileStore/your_path/your@domain.com/label/xml/'
files = pathlib.Path(f'{directory}').glob('*.xml')
rows = []
unscanned_files = []

for n, file in enumerate(files):
  if file.name == 'ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535_without_character_or_first_two_lines.xml':
    pass
  print(f'{n}: {file.name}')

  doc = minidom.parse(str(file))
  soup = BeautifulSoup(doc.toxml(), 'lxml')
  set_id = soup.find('setid')['root']
  text = defaultdict(list)
  
  indication_code = soup.find('code', attrs={'code': '34067-9'}) # Indication and Usage Heading
  unclassified_code = soup.find('code', attrs={'code': '42229-5'}) # Unclassified Heading
  
  # File may not contain Indication and Usage Heading
  if indication_code:
    for sibling in indication_code.nextSiblingGenerator():
      if sibling.name and sibling.text:
        if sibling.name != 'component':
          paragraphs = sibling.find_all('paragraph')
          if paragraphs:
            for paragraph in paragraphs:
              text['34067-9'].append(paragraph.text.strip('\n').replace("\n", ""))            
              # Some Text is contained within lists.  See file 002bf3fe-96c9-4969-b5f8-8818a98be6b2.xml
              for sibling in paragraph.nextSiblingGenerator():
                if sibling.name and sibling.text:
                  lists = sibling.find_all('item')
                  if lists:
                    for list_tag in lists:
                      text['34067-9'].append(list_tag.text.strip('\n').replace("\n", ""))
                         
        else:
          unclassified_code = sibling.find('code', attrs={'code': '42229-5'}) # Code 42229-5 is used for Structured Product Labeling Unclassified Section   
          if unclassified_code:
            for sibling in unclassified_code.nextSiblingGenerator():
              if sibling.name and sibling.text:
                paragraphs = sibling.find_all('paragraph')
                if paragraphs:
                  for paragraph in paragraphs:
                    text['42229-5'].append(paragraph.text.strip('\n').replace("\n", ""))                   
                    # Some Text is contained within lists.  See file 002bf3fe-96c9-4969-b5f8-8818a98be6b2.xml
                    for sibling in paragraph.nextSiblingGenerator():
                      if sibling.name and sibling.text:
                        lists = sibling.find_all('item')
                        if lists:
                          for list_tag in lists:
                            text['42229-5'].append(list_tag.text.strip('\n').replace("\n", ""))
                    
                    
  # Runs if no Indication and Usage Section Found.
  # Indications and Usage may be under the unclassified heading
  elif unclassified_code:
    for sibling in unclassified_code.nextSiblingGenerator():
      if sibling.name and sibling.text:
        paragraphs = sibling.find_all('paragraph')
        if paragraphs:
          for paragraph in paragraphs:
            text['42229-5'].append(paragraph.text.strip('\n').replace("\n", ""))        
            # Some Text is contained within lists.  See file 002bf3fe-96c9-4969-b5f8-8818a98be6b2.xml
            for sibling in paragraph.nextSiblingGenerator():
              if sibling.name and sibling.text:
                lists = sibling.find_all('item')
                if lists:
                  for list_tag in lists:
                    text['42229-5'].append(list_tag.text.strip('\n').replace("\n", ""))            

  
  # If no loic heading is found return none
  else:
    text[None].append(None)
    unscanned_files.append(file.name)
    
  for k,v in text.items():
    for n,l in enumerate(v):
      rows.append((
        file.name, # xml file
        set_id, # drug id
        k,# code https://www.fda.gov/industry/structured-product-labeling-resources/section-headings-loinc 
        n+1,# number of text found 
        l # text
      ))
      
df = pd.DataFrame(rows, columns=['file_name', 'set_id', 'loinc', 'loinc_count_per_file', 'loinc_paragraph_text'])
sdf = spark.createDataFrame(df)
spark.sql("DROP TABLE IF EXISTS sandbox.humanPrescriptionLabel_xml")
sdf.write.mode('overwrite').saveAsTable('sandbox.humanPrescriptionLabel_xml')

Databricks Spark Scala AWS 中的 XML 解析使用 MAVEN - DailyMed 的 HL7 V3 文件

問題描述

1 個解決方案

解決方案1
0 2021-11-11 20:07:00

Databricks Spark Scala AWS 中的 XML 解析使用 MAVEN - DailyMed 的 HL7 V3 文件

問題描述

1 個解決方案

解決方案1 0 2021-11-11 20:07:00

解決方案1
0 2021-11-11 20:07:00