简体   繁体   中英

Key word search just in one column of the file and keeping 2 words before and after key word

Love Python and I am new to Python as well. Here with the help of community (users like Antti Haapala) I was able to proceed some extent. But I got stuck at the end. Please help. I have two tasks remaining before I get into my big data POC. (planning to use this code in 1+ million records in text file)

• Search a key word in Column (C#3) and keep 2 words front and back to that key word.

• Divert the print output to file.

• Here I don't want to touch C#1, C#2 for referential integrity purposes.

Really appreciate for all your help.

My input file:

C #1 C # 2  C# 3   (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it 

Desired output file: (only change in Column 3 or last column)

12088|CITA|very nice lists, better to 
12089|CITA|theme for lists keep it

Code I am currently using:

s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():  
    if not line.strip():
        continue  
    fields = line.split(None, 2)  
    joined = '|'.join(fields)
    print(joined)

BTW, If I use the key word search, I am looking my 1st and 2nd columns. My challenge is keep 1st and 2nd columns without change. And search only 3rd column and keep 2 words after/before key word/s.

First I need to warn you that using this code for 1million records is dangerous. You are dealing with regular expression and this method is good as long as expressions are regular. Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to.

For 1 million cases you'll need pandas as for loop is too slow.

import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
                                   "This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
                        re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
                                   str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())

which gives

df
      C1    C2                           C3
0  12088  CITA  very nice lists, better  to
1  12089  CITA      theme for lists keep it

There are still some questions left about how exactly you strive to perform your keyword search. One obstacle is already contained in your example: how to deal with characters such as commas? Also, it is not clear what to do with lines that do not contain the keyword. Also, what to do if there are not two words before or two words after the keyword? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases.

Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. I have created the function findword() , and you can adjust it to whatever you like. So, maybe this example helps you finding your own requirements.

KEYWORD = "lists"

S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """


def findword(words, keyword):
    """Return index of first occurrence of `keyword` in sequence
    `words`, otherwise return None.

    The current implementation searches for "keyword" as well as
    for "keyword," (with trailing comma).
    """
    for test in (keyword, "%s," % keyword):
        try:
            return words.index(test)
        except ValueError:
            pass
    return None


for line in S.splitlines():
    tokens = line.split("|")
    words = tokens[2].split()
    idx = findword(words, KEYWORD)
    if idx is None:
        # Keyword not found. Print line without change.
        print line
        continue
    l = len(words)
    start = idx-2 if idx > 1 else 0
    end = idx+3 if idx < l-2 else -1
    tokens[2] = " ".join(words[start:end])
    print '|'.join(tokens)

Test:

$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it

PS: I hope I got the indices right for slicing. You should check, nevertheless.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM