简体   繁体   English

在R / Python中从非结构化数据中提取数据集

[英]Extracting data sets from unstructured data in R/ Python

We are trying to extract travel itineraries from travel requests, which are filled by standardized auditors. 我们正在尝试从旅行要求中提取旅行路线,这些旅行要求由标准化审核员填写。

无法上传图片

Example: 例:

EY  275   13FEB HYDAUH 0425   0715  

here the Data would imply as below 这里的数据暗示如下

EY> Travel Type
275> Flight Number
13FEB> Date of travel
HYDAUH>Airports involved during the Departure and arrival.
0425   0715 > Boarding and landing times of the flight.

Here we would need to extract the individual data elements from the raw text fields, and later map them to their respective travel fields and compute several values. 在这里,我们需要从原始文本字段中提取单个数据元素,然后将它们映射到它们各自的旅行字段并计算几个值。

Are there procedures in R/Python to achive them with minimal efforts. R / Python中是否有程序可以以最小的努力实现它们。

I am looking for subsist functions/procedures for the data split/mappings. 我正在寻找用于数据拆分/映射的辅助函数/过程。

If you can extract a single record, as shown in your second example, and if there is always at least one space between the fields, then pulling out the individual pieces of data is straightforward in Python: 如果您可以提取一条记录(如第二个示例所示),并且在字段之间始终至少有一个空格,那么在Python中提取单个数据很简单:

>>> itin = 'EY  275   13FEB HYDAUH 0425   0715'
>>> ifields = itin.split()
>>> ifields[0] # travel type
'EY'
>>> ifields[1] # flight number
'275'
>>> ifields[2] # date of travel
'13FEB'
>>> ifields[3][0:3] # departure airport
'HYD'
>>> ifields[3][3:6] # destination airport
'AUH'
>>> ifields[4] # boarding time
'0425'
>>> ifields[5] # landing time
'0715'

Your first example shows a second record following directly on from the first with no space - is that correct? 您的第一个示例显示了第二条记录,紧跟着第一条记录,没有空格-是正确的吗? If so, is each record always the same number of characters in length? 如果是这样,每个记录的长度是否总是相同?

>>> itinline = 'QR 529  09AUG MAADOH  0405  0600QR  67  09AUG DOHFRA  0815'
>>> itinline[0:32]
'QR 529  09AUG MAADOH  0405  0600'
>>> itinline[32:64]
'QR  67  09AUG DOHFRA  0815'

If your data has multiple records of variable length on a single line, or if there may or may not be spaces between each field, the parsing becomes more complex, but should still be fairly easy to do in Python. 如果您的数据在一行上有多个长度可变的记录,或者每个字段之间可能有也可能没有空格,则解析会变得更加复杂,但在Python中仍然应该相当容易。 In this case please post a more complete example with several records and show the output you want to get. 在这种情况下,请发布带有多个记录的更完整的示例,并显示您想要获得的输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Extracting Unstructured data with python - 从扫描的书籍中提取表格 - Extracting Unstructured data with python - Extracting tables from a scanned book 使用Python从大型非结构化文本文件中提取数据元素 - Extracting data elements from large unstructured text files with Python 将非结构化数据解析为来自 R 或 Python 中的 pdf 的结构化数据 - Parsing unstructured data to structured data from pdf in R or Python 从非结构化文本中提取特定类型的数据,即研究所 - Extracting a particular type of data from unstructured text namely Institutes 将非结构化数据从 excel 转换为结构化数据 python - Transforming unstructured data from excel into structured in python 将非结构化数据转换为 Python 字典 - Convert unstructured data into a Python Dictionary 解析非结构化数据帧 python - Parsing unstructured data frame python 如何从python中的非结构化数据中查找城市名称和人名 - How to find city names and person names from unstructured data in python 从文本文件中读取(有点)非结构化数据以创建 Python 字典 - Reading (somewhat) unstructured data from a text file to create Python Dictionary Python 如何处理来自文本文件的非结构化数据 - Python How to Handle Data Unstructured From Text File
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM