简体   繁体   English

从txt文件中提取字符串到CSV

[英]Extract strings from txt file into CSV

I'm trying to extract strings from a .txt file with a few thousand sequences and write a CSV with these strings. 我正在尝试从具有数千个序列的.txt文件中提取字符串,并使用这些字符串编写CSV。 I have deleted all of the irrelevant information from the original .txt file and this is the format of the document I have now: 我已经从原始的.txt文件中删除了所有不相关的信息,这是我现在拥有的文档的格式:

DEFINITION  Homo sapiens haplogroup HV5 mitochondrion, complete genome.
ACCESSION   DQ377992
/haplogroup="HV5"
/pop_variant="Ashkenazi Jew"
/note="ethnicity:Ashkenazi Jew; origin_locality:Belarus:Homel' Volast', Vyetka; origin_coordinates:52.51 N 31.17 E"
DEFINITION  Homo sapiens haplotype U5b1c mitochondrion, complete genome.
ACCESSION   DQ661681
/haplotype="U5b1c"
/note="Native American (Cherokee)"

I am trying to extract the accession numbers, haplotype or haplogroup, ethnicity, location (origin_locality), coordinates (origin_coordinates) and any additional information that might have been put in /note= to a csv. 我正在尝试提取登录号,单倍型或单倍群,种族,位置(origin_locality),坐标(origin_coordinates)以及可能放在/note=中的任何其他信息。 One of the problems I am facing is that not every sequence has all of the information and not all of the strings are in their own quotation marks. 我面临的问题之一是,并非每个序列都具有所有信息,并且并非所有字符串都用自己的引号引起来。

How do I extract the accession numbers, the strings between quotation marks and make sure that I am extracting the right strings to the right sequence? 如何提取登录号,引号之间的字符串,并确保我按正确的顺序提取正确的字符串? Also how would I deal with the strings that are only separated by semicolons? 另外,我将如何处理仅用分号分隔的字符串?

edit: The other question does not address missing information or the resulting alignment in a CSV which was my primary concern. 编辑:另一个问题不能解决丢失的信息或CSV中导致的对齐问题,这是我主要关心的问题。

You can create a class with all possible parameters as attributes. 您可以使用所有可能的参数作为属性来创建一个类。 Then loop through all lines, with creating a new object whenever required (ie, when line starts with 'Definition') and filling up attribute values of that object. 然后循环遍历所有行,并在需要时(即,当行以“ Definition”开头时)创建一个新对象,并填充该对象的属性值。 After that you can reference that object and write its atrributes' value in the csv. 之后,您可以引用该对象并将其属性值写入csv。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM