關鍵字搜索僅在文件的一列中，並在關鍵字前后保留2個單詞

Question

喜歡Python，我也是Python的新手。 在社區（如Antti Haapala之類的用戶）的幫助下，我得以在一定程度上繼續前進。 但是我最后被困住了。 請幫忙。 在進入大數據POC之前，我還有兩個任務。 （計划在文本文件的1+百萬條記錄中使用此代碼）

•在列（C＃3）中搜索一個關鍵字，並在該關鍵字的前后兩個單詞。

•將打印輸出轉移到文件。

•在這里，我不想為了引用完整性目的而碰觸C＃1，C＃2。

非常感謝您的幫助。

我的輸入文件：

C #1 C # 2  C# 3   (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it

所需的輸出文件：（僅在第3列或最后一列中更改）

12088|CITA|very nice lists, better to 
12089|CITA|theme for lists keep it

我當前正在使用的代碼：

s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():  
    if not line.strip():
        continue  
    fields = line.split(None, 2)  
    joined = '|'.join(fields)
    print(joined)

順便說一句，如果我使用關鍵字搜索，那么我正在查找第一列和第二列。 我的挑戰是保持第一和第二列不變。 並且僅搜索第三列，並在關鍵字之后/之前保留2個單詞。

Answer 1

首先，我需要警告您，將此代碼用於100萬條記錄是危險的。 您正在處理正則表達式，只要表達式是正則表達式，此方法就很好。 否則，您最終可能會創建大量的案例，以提取所需的數據，而無需提取不需要的數據。

對於一百萬個案例，您需要熊貓，因為循環太慢。

import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
                                   "This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
                        re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
                                   str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())

這使

df
      C1    C2                           C3
0  12088  CITA  very nice lists, better  to
1  12089  CITA      theme for lists keep it

Answer 2

關於您如何努力執行關鍵字搜索，仍然存在一些問題。 您的示例中已經包含一個障礙：如何處理逗號等字符？ 同樣，不清楚如何處理不包含關鍵字的行。 另外，如果關鍵字之前沒有兩個單詞或關鍵字之后沒有兩個單詞怎么辦？ 我猜您自己不確定確切的要求，也沒有考慮所有的極端情況。

但是，我對這些問題做出了一些“盲目的決定”，這是一個幼稚的示例實現，它假定您的關鍵字匹配規則非常簡單。 我已經創建了函數findword() ，您可以將其調整為所需的值。 因此，也許這個例子可以幫助您找到自己的需求。

KEYWORD = "lists"

S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """


def findword(words, keyword):
    """Return index of first occurrence of `keyword` in sequence
    `words`, otherwise return None.

    The current implementation searches for "keyword" as well as
    for "keyword," (with trailing comma).
    """
    for test in (keyword, "%s," % keyword):
        try:
            return words.index(test)
        except ValueError:
            pass
    return None


for line in S.splitlines():
    tokens = line.split("|")
    words = tokens[2].split()
    idx = findword(words, KEYWORD)
    if idx is None:
        # Keyword not found. Print line without change.
        print line
        continue
    l = len(words)
    start = idx-2 if idx > 1 else 0
    end = idx+3 if idx < l-2 else -1
    tokens[2] = " ".join(words[start:end])
    print '|'.join(tokens)

測試：

$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it

PS：我希望我得到適合切片的索引。 不過，您應該檢查一下。

關鍵字搜索僅在文件的一列中，並在關鍵字前后保留2個單詞

問題描述

2 個解決方案

解決方案1
1 2015-02-08 21:04:39

解決方案2
0 2015-02-08 20:46:51

關鍵字搜索僅在文件的一列中，並在關鍵字前后保留2個單詞

問題描述

2 個解決方案

解決方案1 1 2015-02-08 21:04:39

解決方案2 0 2015-02-08 20:46:51

解決方案1
1 2015-02-08 21:04:39

解決方案2
0 2015-02-08 20:46:51