简体   繁体   English

如何从python中的文本文件的多行提取两个特定数字

[英]How can I extract two specific numbers from multiple line of a text file in python

I have a very large text file that has say, latitude measurements from 2 GPS antennas. 我有一个非常大的文本文件,其中包含从2个GPS天线进行的纬度测量。 There is a lot of garbage data in the file, and I need to extract the latitude measurements from it. 文件中有很多垃圾数据,我需要从中提取纬度测量值。 These always occur occasionally in between other lines of other text. 这些总是偶尔在其他文本的其他行之间发生。 The line in which they occur looks like: 它们发生的行如下所示:

12:34:56.789    78:90:12.123123123  BLAH_BLAH   blahblah    :      LAT #1 MEAS=-80[deg], LAT #2 MEAS=-110[deg]  blah_BlHaBKBjFkjsa.c

The numbers that I need are the ones between " LAT #1 MEAS=-80[deg] " and " LAT #2 MEAS=-110[deg] ". 我需要的数字是“ LAT #1 MEAS=-80[deg] ”和“ LAT #2 MEAS=-110[deg] ”之间的数字。 So, basically -80 and -110 . 因此,基本上是-80-110

The remaining text is not important for me. 剩下的文字对我来说并不重要。

Here is a sample text from the input file: 这是来自输入文件的示例文本:

08:59:07.603    08:59:05.798816 PAL_PARR_INTF   TraceModule GET int@HISR :82    drv_Shm.c (../../../PALCommon/Platform_EV/HAL/Common/driver/Shm/src)    525 
08:59:07.603    08:59:05.798816 PAL_PARR_INTF   TraceModule xdma is not running drv_Shm.c (../../../PALCommon/Platform_EV/HAL/Common/driver/Shm/src)    316 
08:59:07.603    08:59:05.798847 PAL_PARR_INTF   TraceModule DMA is activated    drv_Shm.c (../../../PALCommon/Platform_EV/HAL/Common/driver/Shm/src)    461 
08:59:10.847    08:59:09.588001 UHAL_SRCH   TraceFlow   :      LAT #1 MEAS=-80[deg], LAT #2 MEAS=-110[deg]  uhal_CHmcpPschMultiPath.c (../../../HEDGE/UL1/UHAL_3XX/Searcher/Code/Src)   1596    
08:59:11.440    08:59:10.876819 UHAL_COMMON TraceWarning    cellRtgSlot=0 cellRtgChip=1500 CELLK_ACTIVE=1 boundary RSN 232482 current RSN 232482 boundarySFN 508 currentSFN 508 uhal_Hmcp.c (../../../HEDGE/UL1/UHAL_3XX/platform/Code/Src) 2224    
08:59:11.440    08:59:10.877277 UHAL_SRCH   TraceWarning    uhal_HmcpSearcherS1LISR: status_reg(0xf0100000) uhal_CHmcpPschMultiPath.c (../../../HEDGE/UL1/UHAL_3XX/Searcher/Code/Src)   1497    
08:59:11.440    08:59:10.877307 UHAL_COMMON TraceWarning    uhal_HmcpSearcherSCDLISR is called. uhal_CHmcpPschMultiPath.c (../../../HEDGE/UL1/UHAL_3XX/Searcher/Code/Src)   1512    
08:59:11.440    08:59:10.877338 UHAL_SRCH   TraceFlow   :      LAT #1 MEAS=-78[deg], LAT #2 MEAS=-110[deg]  uhal_CHmcpPschMultiPath.c (../../../HEDGE/UL1/UHAL_3XX/Searcher/Code/Src)   1596    

Now, i am using the code to open the file and get these values but it doesn't work. 现在,我正在使用代码打开文件并获取这些值,但是它不起作用。 I am new to programming, so I have no idea where I'm going wrong here. 我是编程新手,所以我不知道我在哪里出错。

import re                                                                       

    # Importing 're' for using regular expressions

file_dir=raw_input('Enter the complete Directory of the file (eg c:\\abc.txt):')    # Providing the user with a choice to open their file in .txt format
with open(file_dir, 'r') as f:
    lat_lines= f.read()                                                            # storing the data in a variable

# Declaring the two lists to hold the numbers
raw_lat1 = []
raw_lat2 = []

start_1 = 'LAT #1 MEAS='
end_1 = '[de'

start_2 = 'LAT #2 MEAS='
end_2 = '[de'

x = re.findall(r'start_1(.*?)end_1',lat_lines,re.DOTALL)
raw_lat1.append(x)

y = re.findall(r'start_2(.*?)end_2',lat_lines,re.DOTALL)
raw_lat2.append(y)

This should do it (it doesn't use a regex, but it'll still work) 这应该可以做到(它不使用正则表达式,但仍然可以使用)

answer = []
with open('file.txt') as infile:
    for line in infile:
        if "LAT #1 MEAS=" not in line: continue
        if "LAT #2 MEAS=" not in line: continue
        splits = line.split('=')
        temp = [0,0]
        for i,part in enumerate(splits):
            if part.endswith("LAT #1 MEAS"): temp[0] = int(splits[i+1].split(None,1)[0].split('[',1)[0])
            elif part.endswith("LAT #2 MEAS"): temp[1] = int(splits[i+1].split(None,1)[0].split('[',1)[0])
        answer.append(temp)

There are a couple problems with the regex that I can see from here. 从这里我可以看到正则表达式存在两个问题。 In your re.findall call, you're using start_1 and end_2 as if they're variables, but the regular expression will actually just treat them as the raw characters "start_1" and "end_1" , etc. To use the variables in the regular expression string, you would have to use format strings instead. re.findall调用中,您将start_1end_2当作变量使用,但是正则表达式实际上只是将它们视为原始字符"start_1""end_1"等, "end_1" 。正则表达式字符串,则必须使用格式字符串。 Example: 例:

r'%s(.*?)%s' % (start_1, end_1)

Also, when you use .*end_1 , this will match any character, so it will match all characters until the final occurence of end_1 on the line. 另外,当您使用.*end_1 ,它将匹配任何字符,因此它将匹配所有字符,直到end_1最后出现end_1为止。 Both LAT #1 and LAT #2 end the same way, so if everything else were correct about the string, this would actually match `"-80[deg], LAT #2 MEAS=-110[de" LAT #1LAT #2以相同的方式结束,因此,如果字符串中的其他所有内容都正确,则实际上将匹配“ --80°,LAT#2 MEAS = -110 [de”

Additionally, when you use brackets in a regular expression, you must escape them. 此外,当在正则表达式中使用方括号时,必须将其转义。 Literal brackets are used to specify a character set in regexes. 尖括号用于在正则表达式中指定字符集。

Here's an example where I just assume the variable line contains your sample string "12:34:56.789 78:90:12.123123123 BLAH_BLAH blahblah : LAT #1 MEAS=-80[deg], LAT #2 MEAS=-110[deg] blah_BlHaBKBjFkjsa.c" . 这是一个示例,其中我仅假设变量line包含您的示例字符串"12:34:56.789 78:90:12.123123123 BLAH_BLAH blahblah : LAT #1 MEAS=-80[deg], LAT #2 MEAS=-110[deg] blah_BlHaBKBjFkjsa.c" You might need to adjust this snippet for your whole file. 您可能需要为整个文件调整此代码段。

prefix = r'LAT %s MEAS=(-?\d+)\[deg\]' # includes format string for the variable part of the expression.
p1 = r'#1'
p2 = r'#2
x = re.findall(prefix % p1, line)
y = re.findall(prefix % p2, line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM