简体   繁体   English

Python-将列在数据操作内和写入文件之前转置为行

[英]Python - Transpose columns to rows within data operation and before writing to file

I have developed a public and open source App for Splunk (Nmon performance monitor for Unix and Linux Systems, see https://apps.splunk.com/app/1753/ ) 我已经为Splunk开发了一个公共和开源应用程序(用于Unix和Linux系统的Nmon性能监视器,请参阅https://apps.splunk.com/app/1753/

A master piece of the App is an old perl (recycled, modified and updated) script automatically launched by the App to convert the Nmon data (which is some kind of custom csv), reading it from stdin and writing out to formerly formatted csv files by section (a section is a performance monitor) 该应用程序的杰作是该应用程序自动启动的旧Perl(回收,修改和更新)脚本,用于转换Nmon数据(某种自定义的csv),从stdin读取数据并将其写到以前格式化的csv文件按部分(一个部分是性能监视器)

I want now to fully rewrite this script in Python, which is almost done for a first beta version... BUT i am facing difficulties to transpose data, and i'm afraid not being able to solve it myself. 我现在想用Python完全重写该脚本,这几乎是第一个beta版本的操作。但是我在转置数据时遇到了困难,而且我自己也无法解决它。

This is why i am kindly asking for help today. 这就是为什么我今天恳求帮助。

Here is the difficulty in details: 这是细节上的困难:

Nmon generates performance monitors for various sections (cpu, memory, disks...), for many of them there is no big difficulty but extracting the good timestamp and so on. Nmon会为各个部分(cpu,内存,磁盘...)生成性能监视器,对于许多部分而言,除了提取良好的时间戳等等没有什么大的困难。 But for all sections that have "device" notion (such as DISKBUSY in the provided example, which represents the percentage of time disks were busy) have to be transformed and transposed to be later exploitable 但是,对于所有具有“设备”概念的部分(例如所提供示例中的DISKBUSY,它表示磁盘繁忙时间的百分比)都必须进行转换和转置,以便以后可利用

Currently, i am able to generate the data as follows: 目前,我能够生成如下数据:

Example: 例:

time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0

The goal is to transpose the data such as we will have in the header "time,device,value", example: 目标是转置数据,例如在标题“时间,设备,值”中将得到的数据,例如:

time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0

And so on. 等等。

One month ago, I've opened a question for almost the same need (for another app and not exactly the same data, but the same need to transpose columns to rows) 一个月前,我针对几乎相同的需求提出了一个问题(对于另一个应用程序,数据并不完全相同,但是将列转置为行的需求相同)

Python - CSV time oriented Transposing large number of columns to rows Python-CSV时间导向将大量列转置为行

I had a very great answer which perfectly did the trick, thus i am unable to recycle the piece of code into this new context. 我有一个很好的答案,完美地解决了问题,因此我无法将这段代码回收到这个新的上下文中。 One of difference is that i want to include the data transposition inside within the code, such that the script only works in memory and avoid dealing with multiple temporary files. 区别之一是我想在代码内包括数据转置,这样脚本仅在内存中工作,而避免处理多个临时文件。

Here is the piece of code: 这是一段代码:

Note: needs to use Python 2x 注意:需要使用Python 2x

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################


dynamic_section = ["DISKBUSY"]

for section in dynamic_section:

    # Set output file
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open output for writing
    with open(currsection_output, "w") as currsection:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        currsection.write('time' + ',' + header + '\n'),


                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

        # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

The line: 该行:

                currsection.write('time' + ',' + header + '\n'),

will contain the header 将包含标题

And the line: 和线:

            currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

contains the data line by line 一行一行地包含数据

Note: the final data (header and body data) should in target also contains other information, to simplify things i removed it in the code above 注意:目标中的最终数据(标头和正文数据)还应包含其他信息,为简化起见,我在上面的代码中将其删除了

For static sections which does not require the data transposition, the same lines will be: 对于不需要数据转置的静态部分,相同的行将是:

                    currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),

And: 和:

                currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),

The great goal would be to be able to transpose the data just after the required definition and before writing it. 伟大的目标是能够在所需定义之后和写入之前转置数据。

Also, performance and minimum system resources called (such as working with temporary files instead of memory) is a requirement to prevent from generating too high cpu load on systems periodically the script. 另外,还要求性能和最少的系统资源调用(例如使用临时文件而不是内存)来防止在脚本上定期在系统上产生过高的CPU负载。

Could anyone help me to achieve this ? 有人可以帮助我实现这一目标吗? I've looked for it again and again, i'm pretty sure there is multiple ways to achieve this (zip, map, dictionary, list, split...) but i failed to achieve it... 我已经一遍又一遍地寻找它,我很确定有多种方法可以实现这一目标(zip,地图,字典,列表,拆分...),但是我却没有实现...

Please be indulgent, this is my first real Python script :-) 请放纵,这是我第一个真正的Python脚本:-)

Thank you very much for any help ! 非常感谢您的帮助!

More details: 更多细节:

  • testing nmon file 测试nmon文件

A small testing nmon file can be retrieved here: http://pastebin.com/xHLRbBU0 可以在此处找到一个小的nmon测试文件: http : //pastebin.com/xHLRbBU0

  • Current complete script 当前完整的脚本

The current complete script can be retrieved here: http://pastebin.com/QEnXj6Yh 当前完整的脚本可以在这里找到: http : //pastebin.com/QEnXj6Yh

To test the script, it is required to: 要测试脚本,需要执行以下操作:

  • export the SPLUNK_HOME variable to anything relevant for you, ex: 将SPLUNK_HOME变量导出到与您相关的任何内容,例如:

    mkdir /tmp/nmon2csv mkdir / tmp / nmon2csv

--> place the script and nmon file here, allow execution on script ->将脚本和nmon文件放在此处,允许在脚本上执行

export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon

And finally: 最后:

cat test.nmon | ./nmon2csv.py

Data will be generated in /tmp/nmon2csv/etc/apps/nmon/var/* 数据将在/ tmp / nmon2csv / etc / apps / nmon / var / *中生成

Update: Working code using csv module: 更新:使用csv模块的工作代码:

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################

dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]

for section in dynamic_section:

    # Set output file (will opened after transpose)
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open Temp
    with TemporaryFile() as tempf:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        tempf.write('time' + ',' + header + '\n'),  

                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    tempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),


        # Open final for writing
        with open(currsection_output, "w") as currsection:

            # Rewind temp
            tempf.seek(0)

            writer = csv.writer(currsection)
            writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])           

            for d in csv.DictReader(tempf):
                time = d.pop('time')
                for device, value in sorted(d.items()):
                    row = [section, SN, HOSTNAME, time, device, value]
                    writer.writerow(row)            

            # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

The goal is to transpose the data such as we will have in the header "time,device,value" 目的是转置数据,例如我们将在标题“时间,设备,值”中包含的数据

This rough transposition logic looks like this: 这种粗略的换位逻辑如下所示:

text = '''time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
'''

import csv

for d in csv.DictReader(text.splitlines()):
    time = d.pop('time')
    for device, value in sorted(d.items()):
        print time, device, value

Putting it all together into a complete script looks something like this: 将所有内容放到一个完整的脚本中,如下所示:

import csv

with open('transposed.csv', 'wb') as destfile:
    writer = csv.writer(destfile)
    writer.writerow(['time', 'device', 'value'])
    with open('data.csv', 'rb') as sourefile:
        for d in csv.DictReader(sourcefile):
            time = d.pop('time')
            for device, value in sorted(d.items()):
                row = [time, device, value]
                writer.writerow(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM