[英]Removing page numbers from a .txt file in Python
I am trying to load a .txt file of an ebook and remove lines that contain page numbers. 我正在尝试加载电子书的.txt文件并删除包含页码的行。 The book looks like:
这本书看起来像:
2
Words
More words.
More words.
3
More words.
Here is what I have so far: 这是我到目前为止的内容:
x = 1
with open("first.txt","r") as input:
with open("last.txt","wb") as output:
for line in input:
if line != str(x) + "\n":
output.write(line + "\n")
x + x + 1
My output file comes out with all of the white space (new lines) removed (which I don't want) and it does not even remove the numbers. 我的输出文件出来后,所有空白(换行)都被删除了(我不想要),它甚至没有删除数字。 Does anyone have any ideas?
有人有什么想法吗? Thanks!
谢谢!
1) You don't have to open your file for binary open("last.txt","wb")
-> open("last.txt","w")
2) x + x + 1
-> x += 1
1)您不必为二进制文件
open("last.txt","wb")
-> open("last.txt","w")
打开文件2) x + x + 1
> x += 1
But, you could do it far simpler 但是,您可以轻松得多
with open("first.txt","r") as input:
with open("last.txt","w") as output:
for line in input:
line = line.strip() # clear white space
try:
int(line) #is this a number ?
except ValueError:
output.write(line + "\n")
Check if you can convert the line to an integer and skip this line if that succeeds. 检查是否可以将行转换为整数,如果成功,请跳过此行。 Not the quickest solution, but should work.
不是最快的解决方案,但应该可以。
try:
int(line)
# skip storing that line
continue
except ValueError:
# save the line to output
Use regular expressions to ignore lines that contain just a number. 使用正则表达式忽略仅包含数字的行。
import sys
import re
pattern = re.compile("""^\d+$""")
for line in sys.stdin:
if not pattern.match(line):
sys.stdout.write(line)
Improved solution - one less indentation level, avoid unnecessary strip
and string summation, explicit exception caught. 改进的解决方案-减少了一个缩进级别,避免了不必要的
strip
和字符串求和,捕获了显式异常。
with open("first.txt","r") as input_file, open("last.txt","w") as output_file:
for line in input_file:
try:
int(line)
except ValueError:
output_file.write(line)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.