简体   繁体   中英

Python DataFrame Data Analysis of Large Amount of Data from a Text File

I have the following code:

datadicts = [ ]
with open("input.txt") as f:
    for line in f:
        datadicts.append({'col1': line[':'], 'col2': line[':'], 'col3': line[':'], 'col4': line[':']})

df = pd.DataFrame(datadicts)
df = df.drop([0])
print(df)

I am using a text file (that is not formatted) to pull chunks of data from. When the text file is opened, it looks something like this, except on a way bigger scale:

00 2381    1.3 3.4 1.8 265879 Name 
34 7879    7.6 4.2 2.1 254789 Name 
45 65824   2.3 3.4 1.8 265879 Name 
58 3450    1.3 3.4 1.8 183713 Name 
69 37495   1.3 3.4 1.8 137632 Name 
73 458913  1.3 3.4 1.8 138024 Name 

Here are the things I'm having trouble doing with this data:

  1. I only need the second, third, sixth, and seventh columns of data. The issue with this one, I believe I've solved with my code above by reading the individual lines and creating a dataframe with the columns necessary. I am open to suggestions if anyone has a better way of doing this.
  2. I need to skip the first row of data. This one, the open feature doesn't have a skiprows attribute, so when I drop the first row, I also lose my index starting at 0. Is there any way around this?
  3. I need the resulting dataframe to look like a nice clean dataframe. As of right now, it looks something like this:
Col1   Col2   Col3 Col4
2381    3.4 265879 Name 
7879    4.2 254789 Name 
65824   3.4 265879 Name 
3450    3.4 183713 Name 
37495   3.4 137632 Name 
458913  3.4 138024 Name 

Everything is right-aligned under the column and it looks strange. Any ideas how to solve this?

  1. I also need to be able to perform Statistic Analysis on the columns of data, and to be able to find the Name with the highest data and the lowest data, but for some reason, I always get errors because I think that, even though I've got all the data set up as a dataframe, the values inside the dataframe are reading as objects instead of integers, strings, floats, etc.

So, if my data is not analyzable using Python functions, does anyone know how I can fix this to make the data be able to run correctly?

Any help would be greatly appreciated. I hope I've laid out all of my needs clearly. I am new to Python, and I'm not sure if I'm using all the proper terminology.

You can use the pandas.read_csv() function to accomplish this very easily .

  • txt2pd.txt is a text file containing a copy/paste from your source above
  • sep is using a regex pattern to delimit by one or more consecutive spaces
  • names uses a list to create your column names
  • skiprows skips the first row, per your requirements

Example:

keep = ['col1', 'col3', 'col5', 'col6']
df = pd.read_csv('txt2pd.txt', 
                 sep='\s+', 
                 names=['col0', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6'], 
                 skiprows=1)
df = df[keep]

Output:

     col1  col3    col5  col6
0    7879   4.2  254789  Name
1   65824   3.4  265879  Name
2    3450   3.4  183713  Name
3   37495   3.4  137632  Name
4  458913   3.4  138024  Name

Sample Analysis:

Using df.describe() you can output a simple, high-level analysis. (Anything further should be the subject of a new question.)

                col1      col3           col5
count       5.000000  5.000000       5.000000
mean   114712.200000  3.560000  196007.400000
std    194048.545838  0.357771   61762.106621
min      3450.000000  3.400000  137632.000000
25%      7879.000000  3.400000  138024.000000
50%     37495.000000  3.400000  183713.000000
75%     65824.000000  3.400000  254789.000000
max    458913.000000  4.200000  265879.000000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM