CONTEXT
The code is supposed to get a file object and extract information from it using awk.
It uses readlines() with 'pieceSize' as an argument. 'pieceSize' is the number of MBs I want readlines() to work with as it goes through the file. I did this with hopes that my program wont run into trouble if the file that needs to be read is much greater than my computer's memory. The file being read has many rows and columns.
The code below is trying to read the first field from the first line using awk.
import os
from subprocess import Popen, PIPE, STDOUT
def extract_info(file_object):
pieceSize = 16777216 # 16MB
for line in file_object.readlines(pieceSize):
eachline = line.rsplit() # removing extra returns
p = Popen(['awk','{{print `$`1}}'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
pOut = p.communicate(input=eachline)[0]
print(pOut.decode())
THE ERROR MESSAGE
The error I receive reads something like ...
... in _communicate_with_poll(self, input)
chunk = input[input_offset : input_offset + _PIPE_BUF]
try:
-> input_offset += os.write(fd, chunk)
except OSError as e:
if e.errno == errno.EPIPE:
TypeError: must be string or buffer, not list
The error occurs because str.rsplit()
returns a list , but Popen.communicate()
expects a string (or buffer). So you can't pass the result of eachline
to communicate()
.
That's the cause of the problem, but I'm not sure why you are splitting the lines. rsplit()
will split on all whitespace, that includes spaces, tabs etc. Is that really what you want?
Also, this code will iterate over the first set of lines returned by readlines()
. The rest of the file remains unprocessed. You need an outer loop to keep things going until the input file is exhausted (possibly there is in the calling code that you don't show?). And then it is calling Popen
once for every line of input which is going to be very inefficient.
I suggest that you handle the processing entirely in Python. line.split()[0]
is effectively giving you the data that you need (the first column of the file) without passing it to awk. Iterating line-by-line is memory efficient.
Perhaps a generator is a better solution:
def extract_info(file_object):
for line in file_object:
yield line.split()[0]
Then you can iterate over it in the calling code:
with open('inputfile') as f:
for first_field in extract_info(f):
print first_field
You need to pass a string inside the list returned from split to input:
pOut, _ = p.communicate(input=eachline[0])
You are passing line.rsplit()
ie a list, not sure what you want to pass exactly, maybe you want input=" ".join(eachline)
but whatever it is, it should be a string not the list itself you pass to input. Also your awk syntax seems to be incorrect.
You can also iterate over the file object itself to go line by line avoiding readlines altogether.
for line in file_object:
So the whole code would be something like:
def extract_info(file_object):
for line in file_object:
eachline = line.rsplit() # removing extra returns
p = Popen(['awk','{print $1}'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
pOut,_ = p.communicate(input=" ".join(eachline))
print(pOut.decode())
Obviously fixing the eachline
logic to do whatever it is you expect it to do.
On another note there is no need to use awk at all, you can do all this with python.
def extract_info(file_object):
for line in file_object:
eachline = line.split(None, 1)
print(eachline[0])
Or even more succinctly with map and extended iterable unpacking for python3:
def extract_info(file_object):
for i, *_ in map(str.split, file_object):
print(i)
It's not fully clear what output you are expecting to achieve.
However, maybe this will be helpful:
awk
if all you are doing is print the first word in a line, you can use python for that. readline
or for line in file_handler
, you should avoid using readlines()
and read()
which load the entire file. Try this:
with open('myfile.txt') as f:
for line in f:
first_word = line.split()[0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.