在R / Python中从非结构化数据中提取数据集

Question

We are trying to extract travel itineraries from travel requests, which are filled by standardized auditors. 我们正在尝试从旅行要求中提取旅行路线，这些旅行要求由标准化审核员填写。

无法上传图片

Example: 例：

EY  275   13FEB HYDAUH 0425   0715

here the Data would imply as below 这里的数据暗示如下

EY> Travel Type
275> Flight Number
13FEB> Date of travel
HYDAUH>Airports involved during the Departure and arrival.
0425   0715 > Boarding and landing times of the flight.

Here we would need to extract the individual data elements from the raw text fields, and later map them to their respective travel fields and compute several values. 在这里，我们需要从原始文本字段中提取单个数据元素，然后将它们映射到它们各自的旅行字段并计算几个值。

Are there procedures in R/Python to achive them with minimal efforts. R / Python中是否有程序可以以最小的努力实现它们。

I am looking for subsist functions/procedures for the data split/mappings. 我正在寻找用于数据拆分/映射的辅助函数/过程。

Answer 1

If you can extract a single record, as shown in your second example, and if there is always at least one space between the fields, then pulling out the individual pieces of data is straightforward in Python: 如果您可以提取一条记录（如第二个示例所示），并且在字段之间始终至少有一个空格，那么在Python中提取单个数据很简单：

>>> itin = 'EY  275   13FEB HYDAUH 0425   0715'
>>> ifields = itin.split()
>>> ifields[0] # travel type
'EY'
>>> ifields[1] # flight number
'275'
>>> ifields[2] # date of travel
'13FEB'
>>> ifields[3][0:3] # departure airport
'HYD'
>>> ifields[3][3:6] # destination airport
'AUH'
>>> ifields[4] # boarding time
'0425'
>>> ifields[5] # landing time
'0715'

Your first example shows a second record following directly on from the first with no space - is that correct? 您的第一个示例显示了第二条记录，紧跟着第一条记录，没有空格-是正确的吗？ If so, is each record always the same number of characters in length? 如果是这样，每个记录的长度是否总是相同？

>>> itinline = 'QR 529  09AUG MAADOH  0405  0600QR  67  09AUG DOHFRA  0815'
>>> itinline[0:32]
'QR 529  09AUG MAADOH  0405  0600'
>>> itinline[32:64]
'QR  67  09AUG DOHFRA  0815'

If your data has multiple records of variable length on a single line, or if there may or may not be spaces between each field, the parsing becomes more complex, but should still be fairly easy to do in Python. 如果您的数据在一行上有多个长度可变的记录，或者每个字段之间可能有也可能没有空格，则解析会变得更加复杂，但在Python中仍然应该相当容易。 In this case please post a more complete example with several records and show the output you want to get. 在这种情况下，请发布带有多个记录的更完整的示例，并显示您想要获得的输出。

在R / Python中从非结构化数据中提取数据集

问题描述

1 个解决方案

解决方案1
1 2015-09-22 12:15:20

在R / Python中从非结构化数据中提取数据集

问题描述

1 个解决方案

解决方案1 1 2015-09-22 12:15:20

解决方案1
1 2015-09-22 12:15:20