简体   繁体   中英

Find and Edit Text File

I'm looking to find if there is a way of automating this process. Basically I have 300,000 rows of data needed to download on a daily basis. There are a couple of rows that need to be edited before it can be uploaded to SQL.

Jordan || Michael | 23 | Bulls | Chicago

Bryant | Kobe ||| 8 || LA

What I want to accomplish is to just have 4 vertical bars per row. Normally, I would search for a keyword then edit it manually then save. These two are the only anomalies in my data.

  1. Find "Jordan", then remove the excess 1 vertical bar "|" right after it.
  2. I need to find "Kobe", then remove the two excess vertical bars "|" right after it.

Correct format is below -

Jordan | Michael | 23 | Bulls | Chicago

Bryant | Kobe | 8 || LA

Not sure if this can be done in vbscript or Python. Any help would be appreciated. Thanks!

Python or vbscript could be used but they are overkill for something this simple. Try sed :

$ sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file 
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA

To save to a new file:

sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file >newfile

Or, to change the existing file in-place:

sed -Ei.bak 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file 

How it works

sed reads and processes a file line by line. In our case, we need only the substitute command which has the form s/old/new/g where old is a regular expression and, if it is found, it is replaced by new . The optional g at the end of the command tells sed to perform the substitution command 'globally', meaning not just once but as many times as it appears on the line.

  • s/(Jordan *)\\|/\\1/g

    This tells sed to look for Jordan followed by zero or more spaces followed by a vertical bar and remove the vertical bar.

    In more detail, the parens in (Jordan *) tell sed to save the string Jordan followed by zero or more spaces as a group. In the replacement side, we reference that group as \\1 .

  • s/(Kobe *)\\| *\\|/\\1/g

    Similarly, this tells sed to look for Kobe followed by zero or more spaces followed by a vertical bar and remove the vertical bar.

Using python

Using the same logic as above, here is a python program:

$ cat kobe.py
import re
with open('file') as f:
    for line in f:
        line = re.sub(r'(Jordan *)\|', r'\1', line)
        line = re.sub(r'(Kobe *)\| *\|', r'\1', line)
        print(line.rstrip('\n'))
$ python kobe.py
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA

To save that to a new file:

python kobe.py >newfile

I wrote a code snippet in Python 3.5 as follows.

# -*- coding: utf-8 -*-

rows = ["Jordan||Michael|23|Bulls|Chicago",
        "Bryant|Kobe|||8||LA"]

keywords = ["Jordan", "Kobe"]        

def get_keyword(row, keywords):
    for word in keywords:
        if word in row:
            return word
    else:
        return None            

for line in rows:
    num_bars = line.count('|')
    num_bars_del = num_bars - 4  # Number of bars to be deleted
    kw = get_keyword(line, keywords)
    if kw:  # this line contains a keyword
        # Split the line by the keyword
        first, second = line.split(kw)
        second = second.lstrip()
        result = "%s%s%s"%(first, kw, second[num_bars_del:])
        print(result)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM