Pythonic從這個文本文件中提取值的方法

Question

我有一個遺留軟件的輸出文件，如下所示。 我想從中提取值，例如，我可以將名為direct_solar_irradiance的變量設置為648.957 ，並將target ground pressure為1013.00 。

到目前為止，我一直在提取單個行並像下面那樣處理它們（對於我想要提取的不同值重復多次）：

values = lines[97].split()
self.irradiance_direct, self.irradiance_diffuse, self.irradiance_env = values

但是，我現在發現，當選擇某些參數時，會在輸出的中間添加額外的行。 這意味着，當然第97行將不再具有我需要的值。

鑒於在某些情況下可能會在輸出中添加額外的行，是否有一種好的Pythonic方法來提取這些值？ 我想我需要在文件中搜索已知的文本片段，然后提取它們所引用的數字，但我能想到的唯一方法是非常笨重的。

所以：

是否有一種不錯的Pythonic方法來搜索這些字符串並提取我想要的值？

如果沒有，還有其他方法明智地做到這一點嗎？ （例如，某種很酷的文本文件解析庫，我一無所知）。

 ******************************* 6sV version 1.0B ****************************** * * * geometrical conditions identity * * ------------------------------- * * user defined conditions * * * * month: 14 day : 1 * * solar zenith angle: 10.00 deg solar azimuthal angle: 20.00 deg * * view zenith angle: 30.00 deg view azimuthal angle: 40.00 deg * * scattering angle: 159.14 deg azimuthal angle difference: 20.00 deg * * * * atmospheric model description * * ----------------------------- * * atmospheric model identity : * * midlatitude summer (uh2o=2.93g/cm2,uo3=.319cm-atm) * * aerosols type identity : * * Maritime aerosol model * * optical condition identity : * * visibility : 8.49 km opt. thick. 550 nm : 0.5000 * * * * spectral condition * * ------------------ * * monochromatic calculation at wl 0.400 micron * * * * Surface polarization parameters * * ---------------------------------- * * * * * * Surface Polarization Q,U,Rop,Chi 0.00000 0.00000 0.00000 0.00 * * * * * * target type * * ----------- * * homogeneous ground * * monochromatic reflectance 1.000 * * * * target elevation description * * ---------------------------- * * ground pressure [mb] 1013.00 * * ground altitude [km] 0.000 * * * * plane simulation description * * ---------------------------- * * plane pressure [mb] 1013.00 * * plane altitude absolute [km] 0.000 * * atmosphere under plane description: * * ozone content 0.000 * * h2o content 0.000 * * aerosol opt. thick. 550nm 0.000 * * * * atmospheric correction activated * * -------------------------------- * * BRDF coupling correction * * input apparent reflectance : 0.500 * * * ******************************************************************************* ******************************************************************************* * * * integrated values of : * * -------------------- * * * * apparent reflectance 1.1287696 appar. rad.(w/m2/sr/mic) 588.646 * * total gaseous transmittance 1.000 * * * ******************************************************************************* * * * coupling aerosol -wv : * * -------------------- * * wv above aerosol : 1.129 wv mixed with aerosol : 1.129 * * wv under aerosol : 1.129 * ******************************************************************************* * * * integrated values of : * * -------------------- * * * * app. polarized refl. 0.0000 app. pol. rad. (w/m2/sr/mic) 0.000 * * direction of the plane of polarization 0.00 * * total polarization ratio 0.000 * * * ******************************************************************************* * * * int. normalized values of : * * --------------------------- * * % of irradiance at ground level * * % of direct irr. % of diffuse irr. % of enviro. irr * * 0.351 0.354 0.295 * * reflectance at satellite level * * atm. intrin. ref. background ref. pixel reflectance * * 0.000 0.000 1.129 * * * * int. absolute values of * * ----------------------- * * irr. at ground level (w/m2/mic) * * direct solar irr. atm. diffuse irr. environment irr * * 648.957 655.412 544.918 * * rad at satel. level (w/m2/sr/mic) * * atm. intrin. rad. background rad. pixel radiance * * 0.000 0.000 588.646 * * * * * * sol. spect (in w/m2/mic) * * 1663.594 * * * ******************************************************************************* ******************************************************************************* * * * integrated values of : * * -------------------- * * * * downward upward total * * global gas. trans. : 1.00000 1.00000 1.00000 * * water " " : 1.00000 1.00000 1.00000 * * ozone " " : 1.00000 1.00000 1.00000 * * co2 " " : 1.00000 1.00000 1.00000 * * oxyg " " : 1.00000 1.00000 1.00000 * * no2 " " : 1.00000 1.00000 1.00000 * * ch4 " " : 1.00000 1.00000 1.00000 * * co " " : 1.00000 1.00000 1.00000 * * * * * * rayl. sca. trans. : 0.84422 1.00000 0.84422 * * aeros. sca. " : 0.94572 1.00000 0.94572 * * total sca. " : 0.79616 1.00000 0.79616 * * * * * * * * rayleigh aerosols total * * * * spherical albedo : 0.23410 0.12354 0.29466 * * optical depth total: 0.36193 0.55006 0.91199 * * optical depth plane: 0.00000 0.00000 0.00000 * * reflectance I : 0.00000 0.00000 0.00000 * * reflectance Q : 0.00000 0.00000 0.00000 * * reflectance U : 0.00000 0.00000 0.00000 * * polarized reflect. : 0.00000 0.00000 0.00000 * * degree of polar. : nan 0.00 nan * * dir. plane polar. : -45.00 -45.00 -45.00 * * phase function I : 1.38819 0.27621 0.71751 * * phase function Q : -0.09117 -0.00856 -0.04134 * * phase function U : -1.34383 0.02142 -0.52039 * * primary deg. of pol: -0.06567 -0.03099 -0.05762 * * sing. scat. albedo : 1.00000 0.98774 0.99261 * * * * * ******************************************************************************* ******************************************************************************* ******************************************************************************* * atmospheric correction result * * ----------------------------- * * input apparent reflectance : 0.500 * * measured radiance [w/m2/sr/mic] : 260.747 * * atmospherically corrected reflectance * * Lambertian case : 0.52995 * * BRDF case : 0.52995 * * coefficients xa xb xc : 0.00241 0.00000 0.29466 * * y=xa*(measured radiance)-xb; acr=y/(1.+xc*y) *

Answer 1

更完整，可能更強大的解決方案將需要使用使用自定義語法（ pyparsing ）的解析器或某種基於FSM的處理器（ TextFSM ）。

這兩個選項對於使用此輸出都是非常重要的。 （可能）較輕的解決方案是基於已知標簽識別每條線，然后適當地提取（如其他海報所示）。

有幾種方法可以實現這一點。 我建議將'extractor'callables映射到已知的行標簽，然后迭代並調用匹配的提取器。 每個callable都將line和context object / dict作為參數，並根據需要向上下文添加屬性。 有點像https://gist.github.com/1035938

Answer 2

你可以拋出自己的迷你語言，即自動提取。 我做了以下操作來自動解析專有程序輸出

# will match in the order written here
tokens = ["num_ref_frames", "Max QP", "Min QP", "Avg QP", "I4x4",
          "I16x16", "SkipZero", "SkipMV", "16x16", "16x8", "8x16",
          "8x8", "8x4", "4x8", "4x4"]

special = ["Quarterpel MVs"]

# this dictionary (hash-table) contains the search string from tokens array
# as well as an array where the first element is the field to extract to
# create matrix array. e.g. 0 = 1st field, 1 = 2nd field, 3 = 3rd field etc.
dict = {tokens[0]:  [1], tokens[1]:  [1], tokens[2]:  [1], tokens[3]:  [1],
        tokens[4]:  [2], tokens[5]:  [2], tokens[6]:  [2], tokens[7]:  [2],
        tokens[8]:  [2], tokens[9]:  [2], tokens[10]: [2], tokens[11]: [2],
        tokens[12]: [2], tokens[13]: [2], tokens[14]: [2],}

然后我簡單地循環輸入，並為每一行檢查token的內容; 如果找到匹配，我根據dict-entry進行拆分以提取正確的字段。

special以上是處理，以及，需要從多行讀取的特殊變量。

更新

克隆git://gist.github.com/1037403.git獲取代碼的副本

usage:
./parser.py all_dec.txt

希望能幫助到你！

Answer 3

好吧，如果你想要一個通用的解析庫，那就有pyparsing ，但在這種情況下可能會有點過分。

這似乎是一個相當面向行的文本文件，它的大小不是很大，所以最好的辦法是遍歷每一行，尋找可以識別你所追求的東西的文本。

所以類似於：


lines = open('file.txt', 'r')
for n, line in enumerate(lines):
    if 'direct solar irr.    atm. diffuse irr.    environment  irr' in line:
        values = lines[n+1].split() # after the next line after this one
        self.irradiance_direct, self.irradiance_diffuse, self.irradiance_env = values

然后，您可以根據需要添加更多if語句等，以獲取其他數據。 雖然如果你有很多數據，你可能想要稍微概括一下代碼。 （可能是一個字典，其中包含匹配鍵的文本和匹配鍵時調用的函數）。

您可能還希望使用正則表達式匹配該行，以便您可以更好地處理不同數量的空白區域。 否則只有一個太多或太少的空間會把它扔出去。

Answer 4

最好的方法，恕我直言將使用一個mmaped文件，然后使用正則表達式找到你要找的東西。

 text = mmap.mmap(file)
 re.sub(pattern, text)

Mmap模塊將文件映射為文本，因此您可以執行對字符串執行的任何操作。 正則表達式是搜索某些東西的最佳方式。 簡單高效。

Answer 5

如果您需要查找特定的行，只需將所有內容作為字符串處理並運行特定的正則表達式來挖掘您的寶石。

如果您需要提取更多數據，我相信通過少量工作，您可以為您的數據創建一個很好的解析器。 我將使用以下函數作為開始：

def extract_screens(text):
    """ 
    Returns a list of screens (divided by astericks).
    Each screen is a list of strings stripped from asterisks.
    """
    ...

def process_screen(screen):
    """ 
    Returns a list of screen divisions as tuples: [(heading, body)...]
    heading is a string, body is a list of strings
    blank lines are filtered out.
    """
    ...

到目前為止，您應該有一個索引的文本列表。 您可以遍歷它們並為每個部分執行一個簡單而特定的特殊解析器方法。

提示：使用單元測試來保持自己的理智。

Pythonic從這個文本文件中提取值的方法

問題描述

5 個解決方案

解決方案1
3 2011-06-20 16:28:30

解決方案2
2 已采納 2011-06-20 15:25:16

解決方案3
1 2011-06-20 15:47:43

解決方案4
0 2011-06-20 16:00:11

解決方案5
0 2011-06-20 17:05:48

Pythonic從這個文本文件中提取值的方法

問題描述

5 個解決方案

解決方案1 3 2011-06-20 16:28:30

解決方案2 2 已采納 2011-06-20 15:25:16

解決方案3 1 2011-06-20 15:47:43

解決方案4 0 2011-06-20 16:00:11

解決方案5 0 2011-06-20 17:05:48

解決方案1
3 2011-06-20 16:28:30

解決方案2
2 已采納 2011-06-20 15:25:16

解決方案3
1 2011-06-20 15:47:43

解決方案4
0 2011-06-20 16:00:11

解決方案5
0 2011-06-20 17:05:48