简体   繁体   English

Python:搜索所有包含“ word”的行

[英]Python: search all lines, which contain “word”

I have trouble with text parsing. 我在解析文本时遇到麻烦。

Title: via Grab lib I GET html page, after that I convert it via NLTK lib in text, and put this text in variable. 标题:通过Grab lib获得html页面,然后通过NLTK lib将其转换为文本,然后将该文本放入变量中。 After this, I want search all lines, which contain "word", and print this line. 此后,我要搜索包含“单词”的所有行,并打印此行。

For example we have next text: 例如,我们有下一个文本:

test1: olololo 测试1:olololo
test2: print something test2:打印一些东西
FAQ it's Frequently Asked Question(s) 常见问题解答,它是常见问题
I want search test1 , and print result as: test1: olololo 我想要搜索test1 ,并将结果打印为: test1: olololo

import logging, nltk
from grab import Grab
from urllib import urlopen

logging.basicConfig(level=logging.DEBUG)
parsing_url = raw_input("Enter URL:")
if parsing_url.startswith('http://') or parsing_url.startswith('https://'):
    parsing_url = parsing_url.replace('http://','').replace('https://','')
print parsing_url
g = Grab()
g.go('http://user:pass@' + parsing_url, log_file='out.html')
url = "out.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)

In bash I realized it like: 在bash中,我意识到了这一点:

root@srv:~$ cat 123 | grep "test1"

And as result I get: 结果我得到:

test1: olololo

But in Python I don't want execute bash commands :) 但是在Python中,我不想执行bash命令:)

try this: 尝试这个:

for line in html.split():
  if "test1" in line:
    print line

假设raw是一个字符串列表(即行列表):

good_lines = [l for l in raw if 'test1' in l]

maybe someone might find it useful, i solve this problem like this: 1. decode html to text using NLTK lib 2. record this text to file 3. parsing file via bash command. 也许有人会觉得它有用,我这样解决了这个问题:1.使用NLTK库将html解码为文本2.将文本记录到文件中3.通过bash命令解析文件。 for example: 例如:

status,host = commands.getstatusoutput("cat raw.log | sed 's/^[ \t]*//' | grep -A 2 \"On Host\" | sed -n 2p")

Also, I'm trying to parse this text using the tools python 另外,我正在尝试使用python工具解析此文本

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM