[英]Python text extraction
I'm working on a text extraction with python.我正在使用 python 进行文本提取。 The output is not as desirable as I want it!
输出并不像我想要的那样理想!
I have a text file containing information like this:我有一个包含如下信息的文本文件:
FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Chen, G
Gully, SM
Whiteman, JA
Kilcullen, RN
AF Chen, G
Gully, SM
Whiteman, JA
Kilcullen, RN
TI Examination of relationships among trait-like individual differences,
state-like individual differences, and learning performance
SO JOURNAL OF APPLIED PSYCHOLOGY
CT 13th Annual Conference of the
Society-for-Industrial-and-Organizational-Psychology
CY APR 24-26, 1998
CL DALLAS, TEXAS
SP Soc Ind & Org Psychol
RI Gully, Stanley/D-1302-2012
OI Gully, Stanley/0000-0003-4037-3883
SN 0021-9010
PD DEC
PY 2000
VL 85
IS 6
BP 835
EP 847
DI 10.1037//0021-9010.85.6.835
UT WOS:000165745400001
PM 11125649
ER
and when I use my code like this当我像这样使用我的代码时
import random
import sys
filepath = "data\jap_2000-2001-plain.txt"
with open(filepath) as f:
articles = f.read().strip().split("\n")
articles_list = []
author = ""
title = ""
year = ""
doi = ""
for article in articles:
if "AU" in article:
author = article.split("#")[-1]
if "TI" in article:
title = article.split("#")[-1]
if "PY" in article:
year = article.split("#")[-1]
if "DI" in article:
doi = article.split("#")[-1]
if article == "ER#":
articles_list.append("{}, {}, {}, https://doi.org/{}".format(author, title, year, doi))
print("Oh hello sir, how many articles do you like to get?")
amount = input()
random_articles = random.sample(articles_list, k = int(amount))
for i in random_articles:
print(i)
print("\n")
exit = input('Please enter exit to exit: \n')
if exit in ['exit','Exit']:
print("Goodbye sir!")
sys.exit()
The extraction does not include data that has been entered after the linebreak, If I run this code, output would look like "AU Chen, G" and does not include the other names, same with the Title etc etc.提取不包括换行后输入的数据,如果我运行此代码,输出看起来像“AU Chen,G”并且不包括其他名称,与标题等相同。
My output looks like:我的输出看起来像:
Chen, G. Examination of relationships among trait, 2000, doi.dx.10.1037//0021-9010.85.6.835 Chen, G. 特质间关系的检验, 2000, doi.dx.10.1037//0021-9010.85.6.835
The desired output should be:所需的输出应该是:
Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000, Examination of relationships among trait-like individual differences, state-like individual differences, and learning performance, doi.dx.10.1037//0021-9010.85.6.835 Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000,特质样个体差异、状态样个体差异和学习表现之间关系的检验,doi.dx.10.1037//0021 -9010.85.6.835
but the extraction only includes the first row of each line –但提取只包括每一行的第一行——
Any suggestions?有什么建议?
You need to track what section you are in as you are parsing the file.您需要在解析文件时跟踪您所在的部分。 There are cleaner ways to write the state machine, but as a quick and simple example, you could do something like below.
有更简洁的方法来编写状态机,但作为一个快速而简单的示例,您可以执行以下操作。
Basically, add all the lines for each section to a list for that section, then combine the lists and do whatever at the end.基本上,将每个部分的所有行添加到该部分的列表中,然后组合列表并在最后执行任何操作。 Note, I didn't test this, just psuedo-coding to show you the general idea.
请注意,我没有对此进行测试,只是通过伪编码向您展示总体思路。
authors = []
title = []
section = None
for line in articles:
line = line.strip()
# Check for start of new section, select the right list to add to
if line.startswith("AU"):
line = line[3:]
section = authors
elif line.startswith("TI"):
line = line[3:]
section = title
# Other sections..
...
# Add line to the current section
if line and section is not None:
section.append(line)
authors_str = ', '.join(authors)
title_str = ' '.join(title)
print authors_str, title_str
Initial Understanding初步了解
Based on your example, I believe that:根据你的例子,我相信:
ER
marks the end-of-record .ER
分隔的人工部分标志着记录结束。ER
section contains no usable text. ER
部分不包含可用的文本。 It may also be the case that:也可能是这样:
FN
tag.FN
标签开始。FN / ER
pair can be ignored.FN / ER
对之外遇到的任何文本都可以忽略。 Suggested Design建议设计
If this is true, I recommend you write a text processor using that logic:如果这是真的,我建议您使用该逻辑编写文本处理器:
ER
.ER
。ER
state until a FN
line is encountered.ER
状态中的文本,直到遇到FN
行。ER
state is entered, add the accumulated record to the list of accumulated records.ER
状态时,将累积记录添加到累积记录列表中。 At the end of this process, you will have a list of records, having various accumulated tags.在此过程结束时,您将获得一个记录列表,其中包含各种累积的标签。 You may then process the tags in various ways.
然后,您可以以各种方式处理标签。
Something like this:像这样的东西:
from warnings import warn
Debug = True
def read_lines_from(file):
"""Read and split lines from file. This is a separate function, instead
of just using file.readlines(), in case extra work is needed like
dos-to-unix conversion inside a unix environment.
"""
with open(file) as f:
text = f.read()
lines = text.split('\n')
return lines
def parse_file(file):
"""Parse file in format given by
https://stackoverflow.com/questions/54520331
"""
lines = read_lines_from(file)
state = 'ER'
records = []
current = None
for line_no, line in enumerate(lines):
tag, rest = line[:2], line[3:]
if Debug:
print(F"State: {state}, Tag: {tag}, Rest: {rest}")
# Skip empty lines
if tag == '':
if Debug:
print(F"Skip empty line at {line_no}")
continue
if tag == ' ':
# Append text, except in ER state.
if state != 'ER':
if Debug:
print(F"Append text to {state}: {rest}")
current[state].append(rest)
continue
# Found a tag. Process it.
if tag == 'ER':
if Debug:
print("Tag 'ER'. Completed record:")
print(current)
records.append(current)
current = None
state = tag
continue
if tag == 'FN':
if state != 'ER':
warn(F"Found 'FN' tag without previous 'ER' at line {line_no}")
if len(current.keys()):
warn(F"Previous record (FN:{current['FN']}) discarded.")
if Debug:
print("Tag 'FN'. Create empty record.")
current = {}
# All tags except ER get this:
if Debug:
print(F"Tag '{tag}'. Create list with rest: {rest}")
current[tag] = [rest]
state = tag
return records
if __name__ == '__main__':
records = parse_file('input.txt')
print('Records =', records)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.