简体   繁体   中英

Python Popen.communicate( ). TypeError: Expected String or Buffer, not List

CONTEXT

The code is supposed to get a file object and extract information from it using awk.

It uses readlines() with 'pieceSize' as an argument. 'pieceSize' is the number of MBs I want readlines() to work with as it goes through the file. I did this with hopes that my program wont run into trouble if the file that needs to be read is much greater than my computer's memory. The file being read has many rows and columns.

The code below is trying to read the first field from the first line using awk.

import os
from subprocess import Popen, PIPE, STDOUT

def extract_info(file_object):
    pieceSize = 16777216 # 16MB
    for line in file_object.readlines(pieceSize):
        eachline = line.rsplit() # removing extra returns
        p = Popen(['awk','{{print `$`1}}'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
        pOut = p.communicate(input=eachline)[0]  
        print(pOut.decode())

THE ERROR MESSAGE

The error I receive reads something like ...

... in _communicate_with_poll(self, input)
chunk = input[input_offset : input_offset + _PIPE_BUF]
try:
-> input_offset += os.write(fd, chunk)
except OSError as e:
if e.errno == errno.EPIPE:
TypeError: must be string or buffer, not list

The error occurs because str.rsplit() returns a list , but Popen.communicate() expects a string (or buffer). So you can't pass the result of eachline to communicate() .

That's the cause of the problem, but I'm not sure why you are splitting the lines. rsplit() will split on all whitespace, that includes spaces, tabs etc. Is that really what you want?

Also, this code will iterate over the first set of lines returned by readlines() . The rest of the file remains unprocessed. You need an outer loop to keep things going until the input file is exhausted (possibly there is in the calling code that you don't show?). And then it is calling Popen once for every line of input which is going to be very inefficient.

I suggest that you handle the processing entirely in Python. line.split()[0] is effectively giving you the data that you need (the first column of the file) without passing it to awk. Iterating line-by-line is memory efficient.

Perhaps a generator is a better solution:

def extract_info(file_object):
    for line in file_object:
        yield line.split()[0]

Then you can iterate over it in the calling code:

with open('inputfile') as f:
    for first_field in extract_info(f):
        print first_field

You need to pass a string inside the list returned from split to input:

 pOut, _ = p.communicate(input=eachline[0])

You are passing line.rsplit() ie a list, not sure what you want to pass exactly, maybe you want input=" ".join(eachline) but whatever it is, it should be a string not the list itself you pass to input. Also your awk syntax seems to be incorrect.

You can also iterate over the file object itself to go line by line avoiding readlines altogether.

for line in file_object:  

So the whole code would be something like:

def extract_info(file_object):
    for line in file_object:
        eachline = line.rsplit() # removing extra returns
        p = Popen(['awk','{print $1}'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
        pOut,_ = p.communicate(input=" ".join(eachline))
        print(pOut.decode())

Obviously fixing the eachline logic to do whatever it is you expect it to do.

On another note there is no need to use awk at all, you can do all this with python.

def extract_info(file_object):
    for line in file_object:
        eachline = line.split(None, 1)
        print(eachline[0])

Or even more succinctly with map and extended iterable unpacking for python3:

def extract_info(file_object):
    for i, *_ in map(str.split, file_object):
        print(i)

It's not fully clear what output you are expecting to achieve.

However, maybe this will be helpful:

  • Why use awk if all you are doing is print the first word in a line, you can use python for that.
  • If you would like to read a file with size larger than your memory, you can load each line using readline or for line in file_handler , you should avoid using readlines() and read() which load the entire file.

Try this:

with open('myfile.txt') as f:
    for line in f:
        first_word = line.split()[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM