简体   繁体   English

从 python 中的文本文件打印特定行

[英]Print specific lines from a text file in python

I have a text file which I'm trying to print the lines that don't start with a number but the output is not readable.我有一个文本文件,我试图打印不以数字开头但输出不可读的行。

This is a part of what my code returns:这是我的代码返回的一部分:

{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600

{\fonttbl\f0\fswiss\fcharset0 Helvetica;}

{\colortbl;\red255\green255\blue255;\red4\green4\blue4;\red247\green247\blue247;\red0\green0\blue0;

\red255\green255\blue255;\red77\green77\blue77;}

{\*\expandedcolortbl;;\cssrgb\c1176\c1176\c1176;\cssrgb\c97647\c97647\c97647;\cssrgb\c0\c0\c0;

\cssrgb\c100000\c100000\c100000;\cssrgb\c37647\c37647\c37647;}

\paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0

\deftab720

\pard\pardeftab720\partightenfactor0



\f0\fs26 \cf2 \cb3 Since the start of digital video in 1988, new video formats are developed every year\cf4 \cb5 \

\pard\pardeftab720\partightenfactor0

\cf6 \cb3 00:14\cb5 \

And this is my code:这是我的代码:


numbers = ("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")

aFile = open("/Users/maira/Desktop/text.rtf")

lines = aFile.readlines()

for line in lines:
    if not line.startswith((numbers)):
        print(line)
    
aFile.close()

This is an example of the original text:这是原文的一个例子:

Since the start of digital video in 1988, new video formats are developed every year
00:14
in an attempt to provide improvements in quality, file size and video playback.
00:18
The popularity of video continues to grow rapidly, with 78% of people watching at least
00:24
one digital video on one of their devices every single day; However video formats and
00:29
how they work is still a subject of much confusion for most people.

I've seen some questions similar to mine but I can't get to a solution.我看到了一些与我类似的问题,但我无法找到解决方案。

I appreciate any advices and if there's also a way of deleting the blank lines in between lines, I'd be very thankful.我很感激任何建议,如果还有一种方法可以删除行之间的空白行,我将不胜感激。

Thank you.谢谢你。

I used a very complete function provided on this answer to strip all the rtf text, and after that i use a regex to remove the format numbers HH:MM.我使用了这个答案中提供的一个非常完整的函数来去除所有的 rtf 文本,然后我使用正则表达式来删除格式数字 HH:MM。 Maybe this will help you.也许这会帮助你。

def striprtf(text):
    pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
    # control words which specify a "destionation".
    destinations = frozenset((
        'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid',
        'atnparent','atnref','atntime','atrfend','atrfstart','author','background',
        'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping',
        'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap',
        'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt',
        'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl',
        'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype',
        'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr',
        'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl',
        'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc',
        'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers',
        'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride',
        'listoverridetable','listpicture','liststylename','listtable','listtext',
        'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr',
        'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr',
        'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me',
        'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr',
        'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag',
        'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname',
        'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr',
        'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject',
        'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname',
        'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl',
        'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr',
        'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu',
        'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr',
        'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup',
        'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide',
        'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol',
        'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables',
        'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops',
        'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password',
        'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta',
        'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe',
        'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst',
        'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv',
        'svb','tc','template','themedata','title','txe','ud','upr','userprops',
        'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform',
        'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl',
        'xmlopen',
    ))
    # Translation of some special characters.
    specialchars = {
        'par': '\n',
        'sect': '\n\n',
        'page': '\n\n',
        'line': '\n',
        'tab': '\t',
        'emdash': u'\u2014',
        'endash': u'\u2013',
        'emspace': u'\u2003',
        'enspace': u'\u2002',
        'qmspace': u'\u2005',
        'bullet': u'\u2022',
        'lquote': u'\u2018',
        'rquote': u'\u2019',
        'ldblquote': u'\201C',
        'rdblquote': u'\u201D', 
    }
    stack = []
    ignorable = False       # Whether this group (and all inside it) are "ignorable".
    ucskip = 1              # Number of ASCII characters to skip after a unicode character.
    curskip = 0             # Number of ASCII characters left to skip
    out = []                # Output buffer.
    for match in pattern.finditer(text):
        word,arg,hex,char,brace,tchar = match.groups()
        if brace:
            curskip = 0
            if brace == '{':
            # Push state
                stack.append((ucskip,ignorable))
            elif brace == '}':
            # Pop state
                ucskip,ignorable = stack.pop()
        elif char: # \x (not a letter)
            curskip = 0
            if char == '~':
                if not ignorable:
                    out.append(u'\xA0')
                elif char in '{}\\':
                    if not ignorable:
                        out.append(char)
                elif char == '*':
                    ignorable = True
        elif word: # \foo
            curskip = 0
            if word in destinations:
                ignorable = True
            elif ignorable:
                pass
            elif word in specialchars:
                out.append(specialchars[word])
            elif word == 'uc':
                ucskip = int(arg)
            elif word == 'u':
                c = int(arg)
                if c < 0: c += 0x10000
                if c > 127: out.append(unichr(c))
                else: out.append(chr(c))
                curskip = ucskip
        elif hex: # \'xx
            if curskip > 0:
                curskip -= 1
            elif not ignorable:
                c = int(hex,16)
                if c > 127: out.append(unichr(c))
                else: out.append(chr(c))
        elif tchar:
            if curskip > 0:
                curskip -= 1
            elif not ignorable:
                out.append(tchar)
    return ''.join(out)

with open('/Users/maira/Desktop/text.rtf', 'r') as file:
    rtf = file.read()
    text = striprtf(rtf)
    text = re.sub('(0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]', '', text)
    print(text)
    
file.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM