output of disk usage with regex output

Question

Code below (p =) produces this output

servername300 | SUCCESS | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       3.7T  1.3T  2.4T  16% /data/disk1
/dev/sdj1       3.7T  1.3T  2.4T  36% /data/disk10
/dev/sdk1       3.7T  1.3T  2.4T  36% /data/disk11
servername290 | SUCCESS | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       3.7T  1.4T  2.4T  36% /data/disk1
/dev/sdj1       3.7T  1.4T  2.4T  37% /data/disk10
/dev/sdk1       3.7T  1.4T  2.4T  37% /data/disk11

This code takes this output and figures out if the disks are more than 5% imbalanced.

import re
from subprocess import Popen, PIPE

p = subprocess.check_output(["ansible","dev-data","-m","shell","-a","df -h /data/disk*"], stdin=PIPE)

regex = re.compile(r'\d{1,2}%')

result = [int(a[:-1]) for a in regex.findall(p)]

print min (result)
print max (result)

if max(result) - min(result) > 5:
print("Imbalanced!")
else:
print("Balanced!")

This will give me output

 """16 37 Imbalanced"""

What I want is to include server and disk in my output, example servername300 16 Disk 1, servername290 37 Disk 10 .

Here is what I've tried so far

regex1 = re.compile(r'^servername\d* | \d{1,2}% |disk\d*')

result = [a for a in regex.findall(p)]

I need help to include str and int in my output somehow. THank you

Answer 1

I'd personally parse this as a table instead of doing complex regex, but if you insist, the pattern would be:

^(\S+).*?\n.*?\n(?:^.*?(\d+)%\s+(.*?)(?:\s|$))+

which would give you a match for each server entry, and under each of those groups like: ["servername300", "16", "/data/disk1", "36", "/data/disk10", etc.]

However , Python's regex (and most regex engines for that matter) overwrite the repeated capture groups to only give you the last capture so running the above on your sample data would only yield you [["servername300", "36", "/data/disk11"], ["servername290", "37", "/data/disk11"]] . Sure, you can construct the pattern to include optional captures for as many disks you might expect, or to play with range attributes for the repeated groups until you exhaust them all, but that would quickly turn very ugly, so...

You can split the job - one pattern to isolate the server entry, another to extract the disk usage data, and then just find the minimum and maximum to record the server and disk data, like:

import re
import subprocess

p = subprocess.check_output(["ansible","dev-data","-m","shell","-a","df -h /data/disk*"],
                            stdin=subprocess.PIPE)

# our patterns, self explanatory
SERVER_MATCH = re.compile(r"(?!</)(\S+)(?:.*?\n){2}((?:^/.*(?:\n|$))+)", re.MULTILINE)
DISK_MATCH = re.compile(r"(\d+)%\s+(\S+)")

min_use = 100, None, None  # store for minimum usage entry
max_use = 0, None, None    # store for maximum usage entry
for server in SERVER_MATCH.findall(p):  # get all the servers + their disks
    for disk in DISK_MATCH.findall(server[1]):  # get each disk (usage + mount point)
        disk_usage = int(disk[0])  # convert the usage to integer
        if disk_usage < min_use[0]:  # if current disk's usage is smaller than min_usage
            min_use = (disk_usage, server[0], disk[1])  # replace entry
        if disk_usage > max_use[0]:  # if current disk's usage is smaller than max_usage
            max_use = (disk_usage, server[0], disk[1])  # replace entry

if max_use[0] - min_use[0] > 5:  # check min/max entries for larger discrepancy than 5...
    print("Imbalanced!")  # Imbalanced, print the info:
    print("\tMIN: {1} {0}% {2}".format(*min_use)) 
    print("\tMAX: {1} {0}% {2}".format(*max_use))
else:
    print("Balanced!")

That should give you a nice:

Imbalanced!
    servername300 16% /data/disk1
    servername290 37% /data/disk10

output for your sample data. That's enough for finding the imbalanced range, but if you want to keep all the records for more robust analyzing than just min/max, you can use:

servers = {}
for server in SERVER_MATCH.findall(data):
    servers[server[0]] = [(int(disk[0]), disk[1]) for disk in DISK_MATCH.findall(server[1])]

which will give you a dict with server names for keys and a list of (usage, disk_name) pairs for each of their disks.

output of disk usage with regex output

Question

1 answers

solution1
0 ACCPTED 2017-06-13 22:21:21

output of disk usage with regex output

Question

1 answers

solution1 0 ACCPTED 2017-06-13 22:21:21

solution1
0 ACCPTED 2017-06-13 22:21:21