I receive text files that are structured like this:
random
----new data-----
06/19/2018 13:57:39.99 random information here
06/19/2018 13:58:24.99 some more random info
06/19/2018 13:58:35.08 00:00:04.38 A 00000 0 765 228270257 A0 44 45
06/19/2018 13:58:39.99 00:00:00.00 A 00000 0 756 228270257 A0 4 5
06/19/2018 13:58:40.61 00:00:00.00 A 00000 0 828 228270257 A0 1 7
06/19/2018 13:57:39.99 random information here
06/19/2018 13:58:24.99 some more random info
---end data---
random stuff
There are several lines with random information surrounding the actual data I care about. I only want to keep the rows that have A
in the fourth row, and then I want to turn the data into a CSV file.
Assuming the data above is in play.txt
, I have tried several variants of this. which isn't working:
import csv
import pandas as pd
from io import StringIO
id = []
with open('play.txt', 'r') as fi:
for ln in fi:
if ln.startswith("A",4):
id.append(ln[0:])
id2 = ' '.join(id)
df = pd.read_table(StringIO(id2), delimiter=r'\s+', header=None)
print(df)
df.to_csv('out.csv')
How can this be done in python?
Use the following:
with open('play.txt', 'r') as fi:
for line in fi:
line = line.split(" ")
# you can also use line.split() to split
# the line by all whitespace.
if (len(line)>=4 and line[3]=="A"):
...
This splits by the spaces, and then you can use the list indexing.
Why ln.startswith("A",4)
doesn't work
That code doesn't work for 2 main reasons.
ln.startswith("A", 3)
ln.startswith("A", 3)
gets the literal 4th character in the string. Python reads the line in as a string of characters, which consists of the text that you have. So, using ln.startswith("A", 3)
gets the 4th character, which, in all of the lines, is the character "1".# read the file
file = open('play.txt').read()
id = []
# loop through the file and if the fourth word is 'A' then append that line to 'id'
for line in file.splitlines():
if line.split()[3] == 'A':
id.append(line.split())
# save to a dataframe
df = pd.DataFrame(id)
df
0 1 2 3 4 5 6 7 8 9 10
0 06/19/2018 13:58:35.08 00:00:04.38 A 00000 0 765 228270257 A0 44 45
1 06/19/2018 13:58:39.99 00:00:00.00 A 00000 0 756 228270257 A0 4 5
2 06/19/2018 13:58:40.61 00:00:00.00 A 00000 0 828 228270257 A0 1 7
# if you want specify column names too
# df = pd.DataFrame(id, columns=['col_name_1', 'col_name_2'... ])
# save to csv
df.to_csv('out.csv')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.