简体   繁体   English

使用padas读取文本文件以获取特定行

[英]Reading text file with padas to get the specific lines

I'm trying to read the text log file in Pandas with read_csv method, and I have to read every line in the file before ---- , I have defined the columns names just to get the data based on column to make it ease, but i'm not getting the way to achieve this. 我正在尝试使用read_csv方法读取Pandas中的文本日志文件,而且我必须在----之前读取文件中的每一行,我已经定义了列名称,只是为了基于列获取数据,以便于使用,但我没有办法实现这一目标。

My raw log data: 我的原始日志数据:

myserer143
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.

Are you sure you want to continue [Yy/Nn]?

Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.

Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
  Removing aex-* links in /usr/bin
  Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.

Uninstallation has finished.
dbserer144
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.

Are you sure you want to continue [Yy/Nn]?

Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.
Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
  Removing aex-* links in /usr/bin
  Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.

Uninstallation has finished.

DataFrame looks like below: DataFrame如下所示:

>>> data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a", "b", "c"], engine="python")
>>> data
                                                       a   b   c
0                                              myserer143 NaN NaN
1                        ------------------------------- NaN NaN
2      Stopping Symantec Management Agent for UNIX, L... NaN NaN
3      This will remove the Symantec Management Agent... NaN NaN
4             Are you sure you want to continue [Yy/Nn]? NaN NaN
5                    Uninstalling dependant solutions... NaN NaN
6      Unregistering the Altiris Base Task Handlers f... NaN NaN
7                Unregistering the Script Task Plugin... NaN NaN
8         Unregistering the Power Control Task Plugin... NaN NaN
9       Unregistering the Service Control Task Plugin... NaN NaN

Expected result: 预期结果:

myserer143
dbserer144

OR it its doable 或者它可行

myserer143 Uninstallation has finished
dbserer144 Uninstallation has finished

Use shift with startswith for boolean mask and filter by boolean indexing : 使用带startswith shift作为布尔掩码,并通过boolean indexing过滤:

data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")

m1 = data['a'].shift(-1).str.startswith('----', na=False)
m2 = data['a'].shift(-2).str.startswith('----', na=False)

Filtering rows and also added last row of DataFrame by append : 过滤行,并通过append DataFrame的最后一行:

data = data[m1 | m2].append(data.iloc[[-1]])
print (data)
                               a
0                     myserer143
44  Uninstallation has finished.
45                    dbserer144
89  Uninstallation has finished.

Reshape values and join text together: 重塑值并将文本连接在一起:

df = pd.DataFrame(data.values.reshape(-1,2)).apply(' '.join, 1).to_frame('data')
print (df)
                                      data
0  myserer143 Uninstallation has finished.
1  dbserer144 Uninstallation has finished.

EDIT: 编辑:

For better performace or working with large file is possible loop by each line to list, get values to list of dictionaries and create DataFrame. 为了获得更好的性能或使用大文件,可以通过每一行循环到列表,将值获取到字典列表并创建DataFrame。 Last shift and add last value: 最后移位并添加最后一个值:

data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")

L = []
with open('result.csv', 'r') as f:
    for line in f:
        line = line.strip()
        if line:
            L.append(line)
L = L[-1:] + L

out = [{'a':L[i-1], 'b':L[i-2]} for i, x in enumerate(L) if x.startswith('---') ]
print (out)
[{'a': 'myserer143', 'b': 'Uninstallation has finished.'}, 
 {'a': 'dbserer144', 'b': 'Uninstallation has finished.'}]

df = pd.DataFrame(out)
df['b'] = df['b'].shift(-1).fillna(df.loc[0,'b'])
df = df.apply(' '.join, 1).to_frame('data')
print (df)
                                      data
0  myserer143 Uninstallation has finished.
1  dbserer144 Uninstallation has finished.

Considering that there are many lines in the data you do not need, I think it would be better to prepare the data before loading it into the dataframe. 考虑到数据中不需要很多行,我认为在将数据加载到数据帧之前准备数据会更好。

Based on the file, the parts of information you need are always separated by a delimiter of '-------... , so it makes sense to look ahead in the generator for those lines and only load the 2 lines before the delimiter. 根据文件,您需要的信息部分始终由定界符'-------...分隔,因此有意义的是在生成器中向前查找这些行,并且仅在加载前两行分隔符。

We do that by taking the first 2 lines out as a start, and then looping through the file to get the information you need. 为此,我们将前两行作为开始,然后遍历文件以获取所需的信息。

from itertools import tee, islice, zip_longest

results = []

f = open('sample.txt','r')
n = 2 #number of lines to check
first = next(f)
delim = next(f)

results.append(first)
peek, lines = tee(f)

for idx, val in enumerate(lines):
    if val == delim:
        for val in islice(peek.__copy__(), idx - n, idx):
            results.append(val)
    last = idx

for i in islice(peek.__copy__(), last, last + 1):
    results.append(i)

results
>> ['myserer143\n',
 'Uninstallation has finished.\n',
 'dbserer144\n',
 'Uninstallation has finished.\n',
 'dbserer144\n',
 'Uninstallation has finished.']

At this point, there is no wastage of memory to load the unused lines and your returned list contains the information you need by setting the offsets for the first few lines and getting the last line. 此时,不浪费内存来加载未使用的行,并且通过设置前几行的偏移量并获取最后一行,返回的列表包含了所需的信息。


Then you can group the results in pairs to be loaded to the dataframe using the Python recipe from itertools . 然后,您可以将结果成对分组,使用itertools的Python配方将其加载到数据itertools

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

results = [i.strip() for i in results]
data = list(grouper(results, n))

df = pd.DataFrame(data, columns = ['Name','Status'])
df

>>
         Name                        Status
0  myserer143  Uninstallation has finished.
1  dbserer144  Uninstallation has finished.
2  dbserer144  Uninstallation has finished.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM