[英]Python DataFrame Data Analysis of Large Amount of Data from a Text File
I have the following code:我有以下代码:
datadicts = [ ]
with open("input.txt") as f:
for line in f:
datadicts.append({'col1': line[':'], 'col2': line[':'], 'col3': line[':'], 'col4': line[':']})
df = pd.DataFrame(datadicts)
df = df.drop([0])
print(df)
I am using a text file (that is not formatted) to pull chunks of data from.我正在使用文本文件(未格式化)从中提取数据块。 When the text file is opened, it looks something like this, except on a way bigger scale:
打开文本文件时,它看起来像这样,除了规模更大:
00 2381 1.3 3.4 1.8 265879 Name
34 7879 7.6 4.2 2.1 254789 Name
45 65824 2.3 3.4 1.8 265879 Name
58 3450 1.3 3.4 1.8 183713 Name
69 37495 1.3 3.4 1.8 137632 Name
73 458913 1.3 3.4 1.8 138024 Name
Here are the things I'm having trouble doing with this data:以下是我在处理这些数据时遇到的问题:
Col1 Col2 Col3 Col4
2381 3.4 265879 Name
7879 4.2 254789 Name
65824 3.4 265879 Name
3450 3.4 183713 Name
37495 3.4 137632 Name
458913 3.4 138024 Name
Everything is right-aligned under the column and it looks strange.列下的所有内容都右对齐,看起来很奇怪。 Any ideas how to solve this?
任何想法如何解决这个问题?
So, if my data is not analyzable using Python functions, does anyone know how I can fix this to make the data be able to run correctly?所以,如果我的数据无法使用 Python 函数进行分析,有谁知道我该如何解决这个问题以使数据能够正确运行?
Any help would be greatly appreciated.任何帮助将不胜感激。 I hope I've laid out all of my needs clearly.
我希望我已经清楚地列出了我所有的需求。 I am new to Python, and I'm not sure if I'm using all the proper terminology.
我是 Python 的新手,我不确定我是否使用了所有正确的术语。
You can use the pandas.read_csv()
function to accomplish this very easily .您可以使用
pandas.read_csv()
function轻松完成此操作。
txt2pd.txt
is a text file containing a copy/paste from your source above txt2pd.txt
是一个文本文件,其中包含来自您上面的源代码的复制/粘贴sep
is using a regex pattern to delimit by one or more consecutive spaces sep
使用正则表达式模式来分隔一个或多个连续的空格names
uses a list
to create your column names names
使用list
来创建列名skiprows
skips the first row, per your requirements skiprows
根据您的要求跳过第一行keep = ['col1', 'col3', 'col5', 'col6']
df = pd.read_csv('txt2pd.txt',
sep='\s+',
names=['col0', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
skiprows=1)
df = df[keep]
col1 col3 col5 col6
0 7879 4.2 254789 Name
1 65824 3.4 265879 Name
2 3450 3.4 183713 Name
3 37495 3.4 137632 Name
4 458913 3.4 138024 Name
Using df.describe()
you can output a simple, high-level analysis.使用
df.describe()
您可以对 output 进行简单的高级分析。 (Anything further should be the subject of a new question.) (任何进一步的事情都应该是一个新问题的主题。)
col1 col3 col5
count 5.000000 5.000000 5.000000
mean 114712.200000 3.560000 196007.400000
std 194048.545838 0.357771 61762.106621
min 3450.000000 3.400000 137632.000000
25% 7879.000000 3.400000 138024.000000
50% 37495.000000 3.400000 183713.000000
75% 65824.000000 3.400000 254789.000000
max 458913.000000 4.200000 265879.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.