简体   繁体   English

从 python 中的文本文件打印特定行

[英]Print specific lines from a text file in python

I have a text file which I'm trying to print the lines that don't start with a number but the output is not readable.我有一个文本文件,我试图打印不以数字开头但输出不可读的行。

This is a part of what my code returns:这是我的代码返回的一部分:


{\fonttbl\f0\fswiss\fcharset0 Helvetica;}








\f0\fs26 \cf2 \cb3 Since the start of digital video in 1988, new video formats are developed every year\cf4 \cb5 \


\cf6 \cb3 00:14\cb5 \

And this is my code:这是我的代码:

numbers = ("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")

aFile = open("/Users/maira/Desktop/text.rtf")

lines = aFile.readlines()

for line in lines:
    if not line.startswith((numbers)):

This is an example of the original text:这是原文的一个例子:

Since the start of digital video in 1988, new video formats are developed every year
in an attempt to provide improvements in quality, file size and video playback.
The popularity of video continues to grow rapidly, with 78% of people watching at least
one digital video on one of their devices every single day; However video formats and
how they work is still a subject of much confusion for most people.

I've seen some questions similar to mine but I can't get to a solution.我看到了一些与我类似的问题,但我无法找到解决方案。

I appreciate any advices and if there's also a way of deleting the blank lines in between lines, I'd be very thankful.我很感激任何建议,如果还有一种方法可以删除行之间的空白行,我将不胜感激。

Thank you.谢谢你。

I used a very complete function provided on this answer to strip all the rtf text, and after that i use a regex to remove the format numbers HH:MM.我使用了这个答案中提供的一个非常完整的函数来去除所有的 rtf 文本,然后我使用正则表达式来删除格式数字 HH:MM。 Maybe this will help you.也许这会帮助你。

def striprtf(text):
    pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
    # control words which specify a "destionation".
    destinations = frozenset((
    # Translation of some special characters.
    specialchars = {
        'par': '\n',
        'sect': '\n\n',
        'page': '\n\n',
        'line': '\n',
        'tab': '\t',
        'emdash': u'\u2014',
        'endash': u'\u2013',
        'emspace': u'\u2003',
        'enspace': u'\u2002',
        'qmspace': u'\u2005',
        'bullet': u'\u2022',
        'lquote': u'\u2018',
        'rquote': u'\u2019',
        'ldblquote': u'\201C',
        'rdblquote': u'\u201D', 
    stack = []
    ignorable = False       # Whether this group (and all inside it) are "ignorable".
    ucskip = 1              # Number of ASCII characters to skip after a unicode character.
    curskip = 0             # Number of ASCII characters left to skip
    out = []                # Output buffer.
    for match in pattern.finditer(text):
        word,arg,hex,char,brace,tchar = match.groups()
        if brace:
            curskip = 0
            if brace == '{':
            # Push state
            elif brace == '}':
            # Pop state
                ucskip,ignorable = stack.pop()
        elif char: # \x (not a letter)
            curskip = 0
            if char == '~':
                if not ignorable:
                elif char in '{}\\':
                    if not ignorable:
                elif char == '*':
                    ignorable = True
        elif word: # \foo
            curskip = 0
            if word in destinations:
                ignorable = True
            elif ignorable:
            elif word in specialchars:
            elif word == 'uc':
                ucskip = int(arg)
            elif word == 'u':
                c = int(arg)
                if c < 0: c += 0x10000
                if c > 127: out.append(unichr(c))
                else: out.append(chr(c))
                curskip = ucskip
        elif hex: # \'xx
            if curskip > 0:
                curskip -= 1
            elif not ignorable:
                c = int(hex,16)
                if c > 127: out.append(unichr(c))
                else: out.append(chr(c))
        elif tchar:
            if curskip > 0:
                curskip -= 1
            elif not ignorable:
    return ''.join(out)

with open('/Users/maira/Desktop/text.rtf', 'r') as file:
    rtf = file.read()
    text = striprtf(rtf)
    text = re.sub('(0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]', '', text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM