简体   繁体   English

关键字搜索仅在文件的一列中,并在关键字前后保留2个单词

[英]Key word search just in one column of the file and keeping 2 words before and after key word

Love Python and I am new to Python as well. 喜欢Python,我也是Python的新手。 Here with the help of community (users like Antti Haapala) I was able to proceed some extent. 在社区(如Antti Haapala之类的用户)的帮助下,我得以在一定程度上继续前进。 But I got stuck at the end. 但是我最后被困住了。 Please help. 请帮忙。 I have two tasks remaining before I get into my big data POC. 在进入大数据POC之前,我还有两个任务。 (planning to use this code in 1+ million records in text file) (计划在文本文件的1+百万条记录中使用此代码)

• Search a key word in Column (C#3) and keep 2 words front and back to that key word. •在列(C#3)中搜索一个关键字,并在该关键字的前后两个单词。

• Divert the print output to file. •将打印输出转移到文件。

• Here I don't want to touch C#1, C#2 for referential integrity purposes. •在这里,我不想为了引用完整性目的而碰触C#1,C#2。

Really appreciate for all your help. 非常感谢您的帮助。

My input file: 我的输入文件:

C #1 C # 2  C# 3   (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it 

Desired output file: (only change in Column 3 or last column) 所需的输出文件:(仅在第3列或最后一列中更改)

12088|CITA|very nice lists, better to 
12089|CITA|theme for lists keep it

Code I am currently using: 我当前正在使用的代码:

s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():  
    if not line.strip():
        continue  
    fields = line.split(None, 2)  
    joined = '|'.join(fields)
    print(joined)

BTW, If I use the key word search, I am looking my 1st and 2nd columns. 顺便说一句,如果我使用关键字搜索,那么我正在查找第一列和第二列。 My challenge is keep 1st and 2nd columns without change. 我的挑战是保持第一和第二列不变。 And search only 3rd column and keep 2 words after/before key word/s. 并且仅搜索第三列,并在关键字之后/之前保留2个单词。

First I need to warn you that using this code for 1million records is dangerous. 首先,我需要警告您,将此代码用于100万条记录是危险的。 You are dealing with regular expression and this method is good as long as expressions are regular. 您正在处理正则表达式,只要表达式是正则表达式,此方法就很好。 Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to. 否则,您最终可能会创建大量的案例,以提取所需的数据,而无需提取不需要的数据。

For 1 million cases you'll need pandas as for loop is too slow. 对于一百万个案例,您需要熊猫,因为循环太慢。

import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
                                   "This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
                        re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
                                   str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())

which gives 这使

df
      C1    C2                           C3
0  12088  CITA  very nice lists, better  to
1  12089  CITA      theme for lists keep it

There are still some questions left about how exactly you strive to perform your keyword search. 关于您如何努力执行关键字搜索,仍然存在一些问题。 One obstacle is already contained in your example: how to deal with characters such as commas? 您的示例中已经包含一个障碍:如何处理逗号等字符? Also, it is not clear what to do with lines that do not contain the keyword. 同样,不清楚如何处理不包含关键字的行。 Also, what to do if there are not two words before or two words after the keyword? 另外,如果关键字之前没有两个单词或关键字之后没有两个单词怎么办? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases. 我猜您自己不确定确切的要求,也没有考虑所有的极端情况。

Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. 但是,我对这些问题做出了一些“盲目的决定”,这是一个幼稚的示例实现,它假定您的关键字匹配规则非常简单。 I have created the function findword() , and you can adjust it to whatever you like. 我已经创建了函数findword() ,您可以将其调整为所需的值。 So, maybe this example helps you finding your own requirements. 因此,也许这个例子可以帮助您找到自己的需求。

KEYWORD = "lists"

S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """


def findword(words, keyword):
    """Return index of first occurrence of `keyword` in sequence
    `words`, otherwise return None.

    The current implementation searches for "keyword" as well as
    for "keyword," (with trailing comma).
    """
    for test in (keyword, "%s," % keyword):
        try:
            return words.index(test)
        except ValueError:
            pass
    return None


for line in S.splitlines():
    tokens = line.split("|")
    words = tokens[2].split()
    idx = findword(words, KEYWORD)
    if idx is None:
        # Keyword not found. Print line without change.
        print line
        continue
    l = len(words)
    start = idx-2 if idx > 1 else 0
    end = idx+3 if idx < l-2 else -1
    tokens[2] = " ".join(words[start:end])
    print '|'.join(tokens)

Test: 测试:

$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it

PS: I hope I got the indices right for slicing. PS:我希望我得到适合切片的索引。 You should check, nevertheless. 不过,您应该检查一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM