How to achieve Faster File I/O In Python?

Question

I have a speed/efficiency related question about Python:

I need to extract multiple fields from a nested JSON File (after writing to the .txt files, they have ~ 64k lines and the current snippet does it in ~ 9 mins ), where each line can contain floats and strings.

Normally, I would just put all my data in numpy and use np.savetxt() to save it..

I have resorted to simply assembling the lines as strings, but this is rather slow. So far I'm doing:

Assemble each line as a string(extract the desired field from JSON)
Write string to the concerned file

I have several problems with this:

it's leading to more separate file.write() commands, which are very slow as well (around 64k * 8 calls (for 8 files))

So my question is:

What is a good routine for this kind of problem? One that balances out speed vs memory-consumption for most efficient writing to disk.
Should I increase my DEFAULT_BUFFER_SIZE ? (it's currently 8192)

I have checked this File I/O in Every Programming Language and this python org: IO but didn't help much except(in my understanding after going through it, file io should already be buffered in python 3.6.x) and I found that my default DEFAULT_BUFFER_SIZE is 8192 .

Here's the part of my snippet -

def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

def extract_features_and_write(path_to_data, inp_filename, is_train=True):
    # It's currently having 8 lines of file.write(), which is probably making it slow as writing to disk is  involving a lot of overheads as well
    features = ['meta_tags__twitter-data1', 'url', 'meta_tags__article-author', 'domain', 'title', 'published__$date',\
                'content', 'meta_tags__twitter-description']
    
    prefix = 'train' if is_train else 'test'
    
    feature_files = [open(os.path.join(path_to_data,'{}_{}.txt'.format(prefix, feat)),'w', encoding='utf-8')
                    for feat in features]
    
    with open(os.path.join(PATH_TO_RAW_DATA, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            for idx, features in enumerate(features):
                json_data = read_json_line(line)  

                content = json_data['meta_tags']["twitter:data1"].replace('\n', ' ').replace('\r', ' ').split()[0]
                feature_files[0].write(content + '\n')

                content = json_data['url'].split('/')[-1].lower()
                feature_files[1].write(content + '\n')

                content = json_data['meta_tags']['article:author'].split('/')[-1].replace('@','').lower()
                feature_files[2].write(content + '\n')

                content = json_data['domain']
                feature_files[3].write(content + '\n')

                content = json_data['title'].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[4].write(content + '\n')

                content = json_data['published']['$date']
                feature_files[5].write(content + '\n')

                content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
                content = strip_tags(content).lower()
                content = re.sub(r"[^a-zA-Z0-9]", " ", content)
                feature_files[6].write(content + '\n')

                content = json_data['meta_tags']["twitter:description"].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[7].write(content + '\n')

Answer 1

From comment:

why do you think that 8 writes result in 8 physical writes to your harddisk? The file object itself buffers what to write, if it decides to write to your OS, your OS might as well wait a little until it physically writes - and even then your harrdrives got buffers that might keep the files content for a while until it starts to really write. See How often does python flush to a file?

You should not use exceptions as control flow, nor recurse where it is not needed. Each recursion prepares new call stacks for the function call - that takes ressources and time - and all of it has to be reverted as well.

The best thing to do would be to clean up your data before feeding it into the json.load() ... the next best thing to do would be to avoid recursing ... try something along the lines of:

def read_json_line(line=None):
    result = None

    while result is None and line: # empty line is falsy, avoid endless loop
        try:        
            result = json.loads(line)
        except Exception as e:
            result = None      
            # Find the offending character index:
            idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
            # slice away the offending character:
            line = line[:idx_to_replace]+line[idx_to_replace+1:]

     return result

How to achieve Faster File I/O In Python?

Question

1 answers

solution1
1 ACCPTED 2018-10-13 09:58:07

How to achieve Faster File I/O In Python?

Question

1 answers

solution1 1 ACCPTED 2018-10-13 09:58:07

solution1
1 ACCPTED 2018-10-13 09:58:07