简体   繁体   中英

How to deal with large amount of data in Python

I have a textfile with large amount of data (3 GB). Each line of this textfile contains time, source IP, destination IP and size. As you know the digits in the last section of IP address shows port address. I want to bring those port addresses to a histogram which I did it for 10 000 lines of data but as I could guess the Python code cannot be executed for that large amount of data. I briefly explain the code I have written. First I read that 10 000 data point, later I split them and put all in a list named as everything_list. Just ignore the condition that while loop works. Later I put all the port addresses in a list and draw the histogram of those. Now suppose I have a million of data lines, I cannot read them in the first place let alone to categorize them. Some people told me to use arrays and some told me to process a chunk of data and after that process another chunk of data. I am confused with all people said. Can anybody help me with this issue?

text_file = open("test.data", "r")
a = text_file.read()
text_file.close()

everything_list = a.split()
source_port_list = []
i=0
while 6+7*i<len(everything_list):

    source_element = everything_list[2+7*i]
    source_port_position = source_element.rfind('.')
    source_port_number = int(source_element[source_port_position + 1:])
    source_port_list.append(source_port_number)

    i=i+1


import matplotlib.pyplot as plt
import pylab


numBins = 20
plt.hist(source_port_list, numBins, color='red', alpha=0.8)
plt.show()

This is the lines format:

15:42:42.719063 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719205 IP 129.241.138.133.47843 > 129.63.27.12.2674: tcp 1460
15:42:42.719209 IP 129.63.57.175.45241 > 62.85.5.142.55455: tcp 0
15:42:42.719213 IP 24.34.41.8.1236 > 129.63.1.23.443: tcp 394
15:42:42.719217 IP 59.167.148.152.25918 > 129.63.57.40.36075: tcp 0
15:42:42.719260 IP 129.63.223.16.2823 > 80.67.87.25.80: tcp 682
15:42:42.719264 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0
15:42:42.719269 IP 129.63.184.118.2300 > 64.111.215.46.80: tcp 0

I don't know what the data looks like, but I think the issue is that you try to hold it all in memory at once. You need to do it little by little, read the lines one by one and build the histogram as you go.

histogram = {}
with open(...) as f:
    for line in f:
        ip = ...
        if ip in histogram:
            histogram[ip] += 1
        else:
            histogram[ip] = 1

You can now plot the histogram, but use plt.plot not plt.hist since you already have the frequencies in the histogram dictionary.

You could use a regex and compile it outside your loop.

Altogether with reading your file in lazy mode, line by line.

import re
import matplotlib.pyplot as plt
import pylab

r = re.compile(r'(?<=\.)[0-9]{2,5}(?= \>)')
ports = []

for line in open("test.data", "r"):
    ports.append(re.search(r, line).group(0))

# determines the number of lines you want to take into account
i = (len(ports) - 6) // 7

# keeps only the first i elements
ports = ports[0:i]

numBins = 20
plt.hist(ports, numBins, color='red', alpha=0.8)
plt.show()

This code takes into account the fact that you want only the (n-6) / 7 first items, n being the number of lines of your source file. Try with some +1/-1 if it's not totally accurate. Getting rid of the unwanted items at the end allows your loop not to be bothered with checking a condition on each iteration.

EDIT:

You can combine several things above to get a more concise and efficient code:

import re
import matplotlib.pyplot as plt
import pylab

r = re.compile(r'(?<=\.)[0-9]{2,5}(?= \>)')

ports = [ re.search(r, line).group(0) for line in open("test.data", "r") ]
ports = ports[0:(len(ports) - 6) // 7]

numBins = 20
plt.hist(ports, numBins, color='red', alpha=0.8)
plt.show()

EDIT:

If you think your list of ports will be too large to fit in RAM (which I find unlikely), my advice would be to use a dict of ports:

ports = {}
for line in open("test.data", "r"):
    port = re.search(r, line).group(0)
    if not ports.get(port, False):
        ports[port] = 0
    ports[port] += 1

Which will give you something like:

>>> ports
{
    "8394": 182938,
    "8192": 839288,
    "1283": 9839
}

Note that in such a case, your call to plt.hist will have to be modified.

You can use split and a defaultdict which will be more efficient:

from collections import defaultdict

d = defaultdict(int)
with open("a_file.txt") as f:
    for line in f:
         d[line.split()[2].rsplit(".",1)[-1]] += 1 
print(d)

defaultdict(<type 'int'>, {'1236': 1, '2300': 1, '47843': 2, '45241': 1, '25918': 1, '2823': 1})

Might also be worth checking out different ways to plot, matplotlib is not the most efficient:

pyqtgraph , guiqwt , gnuplot.py

Sounds like you should be iterating by line and using regex to find the port. Try something like this:

import re

ports = []
with open("path/to/your/text/file.txt", 'r') as infile:
    for line in infile:
        ports.append(re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\.(\d+)", line))
        # that regex explained:
        # # re.compile(r"""
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     \d{1,3}\.       # 1-3 digits followed by a literal .
        # #     (               # BEGIN CAPTURING GROUP
        # #       \d+           #   1 or more digits
        # #     )               # END CAPTURING GROUP""", re.X)

This is assuming your IP/port is formatted as you explain in your comment

IP.IP.IP.IP.PORT

I know this is not an immediate response to your question, but as being new to python there is a nice Coursera course dealing with that very subject. "Programming for Everybody (Python)" it is free to take and wont use too much of your time. the course starts February 2 2015. Also the text book "Python for Informatics: Exploring Information" is a Free Creative Commons download. at http://www.pythonlearn.com/book.php I hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM