使用padas讀取文本文件以獲取特定行

Question

我正在嘗試使用read_csv方法讀取Pandas中的文本日志文件，而且我必須在----之前讀取文件中的每一行，我已經定義了列名稱，只是為了基於列獲取數據，以便於使用，但我沒有辦法實現這一目標。

我的原始日志數據：

myserer143
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.

Are you sure you want to continue [Yy/Nn]?

Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.

Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
  Removing aex-* links in /usr/bin
  Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.

Uninstallation has finished.
dbserer144
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.

Are you sure you want to continue [Yy/Nn]?

Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.
Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
  Removing aex-* links in /usr/bin
  Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.

Uninstallation has finished.

DataFrame如下所示：

>>> data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a", "b", "c"], engine="python")
>>> data
                                                       a   b   c
0                                              myserer143 NaN NaN
1                        ------------------------------- NaN NaN
2      Stopping Symantec Management Agent for UNIX, L... NaN NaN
3      This will remove the Symantec Management Agent... NaN NaN
4             Are you sure you want to continue [Yy/Nn]? NaN NaN
5                    Uninstalling dependant solutions... NaN NaN
6      Unregistering the Altiris Base Task Handlers f... NaN NaN
7                Unregistering the Script Task Plugin... NaN NaN
8         Unregistering the Power Control Task Plugin... NaN NaN
9       Unregistering the Service Control Task Plugin... NaN NaN

預期結果：

myserer143
dbserer144

或者它可行

myserer143 Uninstallation has finished
dbserer144 Uninstallation has finished

Answer 1

使用帶startswith shift作為布爾掩碼，並通過boolean indexing過濾：

data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")

m1 = data['a'].shift(-1).str.startswith('----', na=False)
m2 = data['a'].shift(-2).str.startswith('----', na=False)

過濾行，並通過append DataFrame的最后一行：

data = data[m1 | m2].append(data.iloc[[-1]])
print (data)
                               a
0                     myserer143
44  Uninstallation has finished.
45                    dbserer144
89  Uninstallation has finished.

重塑值並將文本連接在一起：

df = pd.DataFrame(data.values.reshape(-1,2)).apply(' '.join, 1).to_frame('data')
print (df)
                                      data
0  myserer143 Uninstallation has finished.
1  dbserer144 Uninstallation has finished.

編輯：

為了獲得更好的性能或使用大文件，可以通過每一行循環到列表，將值獲取到字典列表並創建DataFrame。 最后移位並添加最后一個值：

data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")

L = []
with open('result.csv', 'r') as f:
    for line in f:
        line = line.strip()
        if line:
            L.append(line)
L = L[-1:] + L

out = [{'a':L[i-1], 'b':L[i-2]} for i, x in enumerate(L) if x.startswith('---') ]
print (out)
[{'a': 'myserer143', 'b': 'Uninstallation has finished.'}, 
 {'a': 'dbserer144', 'b': 'Uninstallation has finished.'}]

df = pd.DataFrame(out)
df['b'] = df['b'].shift(-1).fillna(df.loc[0,'b'])
df = df.apply(' '.join, 1).to_frame('data')
print (df)
                                      data
0  myserer143 Uninstallation has finished.
1  dbserer144 Uninstallation has finished.

Answer 2

考慮到數據中不需要很多行，我認為在將數據加載到數據幀之前准備數據會更好。

根據文件，您需要的信息部分始終由定界符'-------...分隔，因此有意義的是在生成器中向前查找這些行，並且僅在加載前兩行分隔符。

為此，我們將前兩行作為開始，然后遍歷文件以獲取所需的信息。

from itertools import tee, islice, zip_longest

results = []

f = open('sample.txt','r')
n = 2 #number of lines to check
first = next(f)
delim = next(f)

results.append(first)
peek, lines = tee(f)

for idx, val in enumerate(lines):
    if val == delim:
        for val in islice(peek.__copy__(), idx - n, idx):
            results.append(val)
    last = idx

for i in islice(peek.__copy__(), last, last + 1):
    results.append(i)

results
>> ['myserer143\n',
 'Uninstallation has finished.\n',
 'dbserer144\n',
 'Uninstallation has finished.\n',
 'dbserer144\n',
 'Uninstallation has finished.']

此時，不浪費內存來加載未使用的行，並且通過設置前幾行的偏移量並獲取最后一行，返回的列表包含了所需的信息。

然后，您可以將結果成對分組，使用itertools的Python配方將其加載到數據itertools 。

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

results = [i.strip() for i in results]
data = list(grouper(results, n))

df = pd.DataFrame(data, columns = ['Name','Status'])
df

>>
         Name                        Status
0  myserer143  Uninstallation has finished.
1  dbserer144  Uninstallation has finished.
2  dbserer144  Uninstallation has finished.

使用padas讀取文本文件以獲取特定行

問題描述

2 個解決方案

解決方案1
2 已采納 2018-11-30 06:33:49

解決方案2
1 2018-12-01 03:10:07

使用padas讀取文本文件以獲取特定行

問題描述

2 個解決方案

解決方案1 2 已采納 2018-11-30 06:33:49

解決方案2 1 2018-12-01 03:10:07

解決方案1
2 已采納 2018-11-30 06:33:49

解決方案2
1 2018-12-01 03:10:07