[英]Reading a .txt file having lines starting with special characters
I am trying to read in every line in a file that starts with different special characters.我试图读取以不同特殊字符开头的文件中的每一行。 I don't want to read those special characters just the rest of the line that follows.我不想在接下来的行中阅读那些特殊字符。 The records are presented in the file like this: enter image description here Then I need to store each document in a dataframe's row.记录在文件中显示如下:在此处输入图像描述然后我需要将每个文档存储在数据框的行中。
This is what I tried:这是我尝试过的:
Doc = namedtuple('Doc', 'venue year authors title id references abstract')
docs = []
with open('/content/test.txt') as f:
for l in f.readlines():
ln = l.rstrip('\n')
if ln.startswith('#c'):
venue = ln[2:]
#print(venue)
if ln.startswith('#t'):
year = ln[2:]
#print(year)
if ln.startswith('#@'):
authors = []
author = ln[2:]
authors.append(author)
if ln.startswith('#*'):
title = ln[2:]
if ln.startswith('#index'):
id = ln[2:]
if ln.startswith('#%'):
references = []
reference = ln[2:]
references.append(reference)
if ln.startswith('#!'):
abstract = ln[2:]
print(abstract)
docs.append(Doc(venue, year, authors, title, id, references, abstract))
df = pd.DataFrame.from_records(docs, columns=
['Venue','Year','Authors','Title','id', 'ListCitations','Abstract'])
df
Can someone help me to solve the problem of variables storing in a list as authors and references?有人可以帮我解决将变量作为作者和引用存储在列表中的问题吗? Thank you谢谢
Fixed the code i guess u are missing some references so I just imported them.修复了代码,我猜你缺少一些参考,所以我只是导入了它们。 Also before running the code u might want to install below mentioned package called pandas同样在运行代码之前,您可能想要安装下面提到的名为 pandas 的包
I am currently using python 3.8, if your using the python 2 version then u might want to skip the 3 in pip command我目前使用的是 python 3.8,如果你使用的是 python 2 版本,那么你可能想跳过 pip 命令中的 3
from collections import namedtuple
import pandas as pd
Doc = namedtuple('Doc', 'venue year authors title id references abstract')
docs = []
f = open("content/test.txt", "r")
while(True):
# Read a line.
#if not ln.startswith("#t") or not ln.startswith("#index"):
ln = f.readline()
print('{} : {}'.format("LINE : ", ln))
ln = ln.rstrip('\n')
# When readline returns an empty string, the file is fully read.
if ln == "":
print("::DONE::")
break
# When a newline is returned, the line is empty.
if ln == "\n":
#print("::EMPTY LINE::")
continue
if ln.startswith('#c'):
venue = ln[2:]
print(venue)
if ln.startswith('#t'):
year = ln[2:]
print(year)
if ln.startswith('#@'):
authors = []
while not ln.startswith('#t'):
if ln.startswith('#@') :
author = ln[2:]
else:
author = ln[0:]
authors.append(author)
ln = f.readline()
print(author)
f.seek(f.tell() - len(ln))
",".join(authors)
if ln.startswith('#*'):
title = ln[2:]
print(title)
if ln.startswith('#index'):
id = ln[6:]
print(id)
if ln.startswith('#%'):
references = []
while not ln.startswith('#!'):
reference = ln[2:]
references.append(reference)
ln = f.readline()
print(reference)
",".join(references)
if ln.startswith('#!'):
abstract = ln[2:]
print(abstract)
docs.append(Doc(venue, year, authors, title, id, references, abstract))
df = pd.DataFrame.from_records(docs, columns=
['Venue','Year','Authors','Title','id', 'ListCitations','Abstract'])
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)
Code:代码:
#*Information geometry of U-Boost and Bregman divergence
#@Nobotu Murata,
Takenouchi
Takafumi Kanamori
#t2004
#cNeural Computation
#index436405
#%94584
#%282290
#%605546
#%620759
#%564878
#!We aim at an extension of AdaBoost to U-Boost
#*Paper 2
#@Tareq
Shareq
Sameena
#t2016
#cSimulation Computation
#index436406
#%94584
#%282291
#%605543
#%620754
#%323232232232323
#!We aim to conquere the world
Test input :测试输入:
Venue Year \
Output:输出:
Authors \
0 Neural Computation 2004 0 神经计算 2004
1 Simulation Computation 2016 1 仿真计算 2016
Title id \
0 [Nobotu Murata,, Takenouchi\\n, Takafumi Kanamo... 0 [Nobotu Murata,, Takenouchi\\n, Takafumi Kanamo...
1 [Tareq , Shareq\\n, Sameena\\n] 1 [Tareq , Shareq\\n, Sameena\\n]
ListCitations \
0 Information geometry of U-Boost and Bregman di... 436405 0 U-Boost 和 Bregman di... 436405 的信息几何
1 Paper 2 436406 1 纸 2 436406
Abstract
0 [94584, 282290\\n, 605546\\n, 620759\\n, 564878\\n] 0 [94584, 282290\\n, 605546\\n, 620759\\n, 564878\\n]
1 [94584, 282291\\n, 605543\\n, 620754\\n, 32323223... 1 [94584, 282291\\n, 605543\\n, 620754\\n, 32323223...
Abstract
0 We aim at an extension of AdaBoost to U-Boost\\n 0 我们的目标是将 AdaBoost 扩展到 U-Boost\\n
1 We aim to conquere the world\\n 1 我们的目标是征服世界\\n
The code bellow works fine with authors but with references still storing only one element:下面的代码适用于作者,但引用仍然只存储一个元素:
Doc = namedtuple('Doc', 'venue year authors nba title id references abstract')
docs = []
with open('/content/test.txt') as f:
for l in f.readlines():
ln = l.rstrip('\n')
if ln.startswith('#c'):
venue = ln[2:]
#print(venue)
if ln.startswith('#t'):
year = ln[2:]
#print(year)
if ln.startswith('#@'):
authors = []
author = ln[2:]
authors.append(author)
else:
if not ln.startswith('#') and not ln.startswith(' ') :
author = ln[0:]
authors.append(author)
#print(authors)
nba = len(authors)
if ln.startswith('#*'):
title = ln[2:]
#print(title)
if ln.startswith('#!'):
abstract = ln[2:]
else :
if ln.startswith(' '):
abstract += ln[1:]
#print(abstract)
if ln.startswith('#%'):
references = []
reference = ln[2:]
references.append(reference)
else:
if ln.startswith('#%'):
reference = ln[2:]
references.append(reference)
#print(references)
if ln.startswith('#index'):
id = ln[6:]
docs.append(Doc(venue, year, authors, nba, title, id, references, abstract))
df = pd.DataFrame.from_records(docs, columns=['Venue','Year','Authors','nba', 'Title','id', 'ListCitations','Abstract'])
df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.