简体   繁体   English

Python DataFrame 文本文件大量数据的数据分析

[英]Python DataFrame Data Analysis of Large Amount of Data from a Text File

I have the following code:我有以下代码:

datadicts = [ ]
with open("input.txt") as f:
    for line in f:
        datadicts.append({'col1': line[':'], 'col2': line[':'], 'col3': line[':'], 'col4': line[':']})

df = pd.DataFrame(datadicts)
df = df.drop([0])
print(df)

I am using a text file (that is not formatted) to pull chunks of data from.我正在使用文本文件(未格式化)从中提取数据块。 When the text file is opened, it looks something like this, except on a way bigger scale:打开文本文件时,它看起来像这样,除了规模更大:

00 2381    1.3 3.4 1.8 265879 Name 
34 7879    7.6 4.2 2.1 254789 Name 
45 65824   2.3 3.4 1.8 265879 Name 
58 3450    1.3 3.4 1.8 183713 Name 
69 37495   1.3 3.4 1.8 137632 Name 
73 458913  1.3 3.4 1.8 138024 Name 

Here are the things I'm having trouble doing with this data:以下是我在处理这些数据时遇到的问题:

  1. I only need the second, third, sixth, and seventh columns of data.我只需要第二、三、六、七列数据。 The issue with this one, I believe I've solved with my code above by reading the individual lines and creating a dataframe with the columns necessary.这个问题,我相信我已经通过阅读单独的行并创建一个包含必要列的 dataframe 来解决上面的代码。 I am open to suggestions if anyone has a better way of doing this.如果有人有更好的方法,我愿意接受建议。
  2. I need to skip the first row of data.我需要跳过第一行数据。 This one, the open feature doesn't have a skiprows attribute, so when I drop the first row, I also lose my index starting at 0. Is there any way around this?这个,开放的特征没有skirows属性,所以当我删除第一行时,我也失去了从0开始的索引。有什么办法解决这个问题吗?
  3. I need the resulting dataframe to look like a nice clean dataframe.我需要生成的 dataframe 看起来像一个干净整洁的 dataframe。 As of right now, it looks something like this:截至目前,它看起来像这样:
Col1   Col2   Col3 Col4
2381    3.4 265879 Name 
7879    4.2 254789 Name 
65824   3.4 265879 Name 
3450    3.4 183713 Name 
37495   3.4 137632 Name 
458913  3.4 138024 Name 

Everything is right-aligned under the column and it looks strange.列下的所有内容都右对齐,看起来很奇怪。 Any ideas how to solve this?任何想法如何解决这个问题?

  1. I also need to be able to perform Statistic Analysis on the columns of data, and to be able to find the Name with the highest data and the lowest data, but for some reason, I always get errors because I think that, even though I've got all the data set up as a dataframe, the values inside the dataframe are reading as objects instead of integers, strings, floats, etc.我还需要能够对数据列进行统计分析,并且能够找到具有最高数据和最低数据的名称,但由于某种原因,我总是会出错,因为我认为,即使我'已将所有数据设置为 dataframe,dataframe 中的值正在读取为对象,而不是整数、字符串、浮点数等。

So, if my data is not analyzable using Python functions, does anyone know how I can fix this to make the data be able to run correctly?所以,如果我的数据无法使用 Python 函数进行分析,有谁知道我该如何解决这个问题以使数据能够正确运行?

Any help would be greatly appreciated.任何帮助将不胜感激。 I hope I've laid out all of my needs clearly.我希望我已经清楚地列出了我所有的需求。 I am new to Python, and I'm not sure if I'm using all the proper terminology.我是 Python 的新手,我不确定我是否使用了所有正确的术语。

You can use the pandas.read_csv() function to accomplish this very easily .您可以使用pandas.read_csv() function轻松完成此操作。

  • txt2pd.txt is a text file containing a copy/paste from your source above txt2pd.txt是一个文本文件,其中包含来自您上面的源代码的复制/粘贴
  • sep is using a regex pattern to delimit by one or more consecutive spaces sep使用正则表达式模式来分隔一个或多个连续的空格
  • names uses a list to create your column names names使用list来创建列名
  • skiprows skips the first row, per your requirements skiprows根据您的要求跳过第一行

Example:例子:

keep = ['col1', 'col3', 'col5', 'col6']
df = pd.read_csv('txt2pd.txt', 
                 sep='\s+', 
                 names=['col0', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6'], 
                 skiprows=1)
df = df[keep]

Output: Output:

     col1  col3    col5  col6
0    7879   4.2  254789  Name
1   65824   3.4  265879  Name
2    3450   3.4  183713  Name
3   37495   3.4  137632  Name
4  458913   3.4  138024  Name

Sample Analysis:样品分析:

Using df.describe() you can output a simple, high-level analysis.使用df.describe()您可以对 output 进行简单的高级分析。 (Anything further should be the subject of a new question.) (任何进一步的事情都应该是一个新问题的主题。)

                col1      col3           col5
count       5.000000  5.000000       5.000000
mean   114712.200000  3.560000  196007.400000
std    194048.545838  0.357771   61762.106621
min      3450.000000  3.400000  137632.000000
25%      7879.000000  3.400000  138024.000000
50%     37495.000000  3.400000  183713.000000
75%     65824.000000  3.400000  254789.000000
max    458913.000000  4.200000  265879.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM