python从tsv文件链接列表

Question

i have this tsv file containing some paths of links each link is seperated by a ';' 我有这个tsv文件，其中包含一些链接路径，每个链接均以';'分隔。 i want to use: 我想使用：

In the example below we can se that the text in the file is seperated and i only want to read through the last column wich is a path starting with '14th' 在下面的示例中，我们可以确定文件中的文本是分开的，我只想阅读最后一列，而该路径以'14th'开头

6a3701d319fc3754    1297740409  166    14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade    NULL
3824310e536af032    1344753412  88     14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade  3
415612e93584d30e    1349298640  138    14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade

I want to somehow split the path into a chain like this: 我想以某种方式将路径分成这样的链：

['14th_century', 'Niger', 'Nigeria'....]

how do i read the file and remove the first 3 columns so i only got the last one ? 我如何读取文件并删除前三列，所以我只有最后一列？

UPDATE: 更新：

i have tried this now: 我现在已经尝试过了：

import re
with open('test.tsv') as f:
    lines = f.readlines()
for line in lines[22:len(lines)]:
    re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
    e_line = line.split(' ')
    real_line = e_line[0]
    print real_line.split(';')

But the problem is that it not deleting the first 3 columns ? 但是问题在于它不删除前3列吗？

Answer 1

If the separator betweeen first is only a space and not a serie of spaces or a tab, you could do that 如果首先使用的分隔符只是一个空格，而不是一系列的空格或制表符，则可以这样做

with open('file_name') as f:
    lines = f.readlines()
for line in lines:
    e_line = line.split(' ')
    real_line = e_line[3]
    print real_line.split(';')

Answer 2

Answer to your updated question. 回答您更新的问题。

But the problem is that it not deleting the first 3 columns ? 但是问题在于它不删除前3列吗？

There are several mistakes. 有几个错误。

Your code: 您的代码：

import re
with open('test.tsv') as f:
    lines = f.readlines()
for line in lines[22:len(lines)]:
    re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
    e_line = line.split(' ')
    real_line = e_line[0]
    print real_line.split(';')

This line does nothing... 这条线什么都不做...

re.sub(r"^\s+", " ", line, flags = re.MULTILINE)

Because re.sub function doesn't change your line variable, but returns replaced string. 因为re.sub函数不会更改您的line变量，而是返回替换后的字符串。 So you may want to do as below. 因此，您可能需要执行以下操作。

line = re.sub(r"^\s+", " ", line, flags = re.MULTILINE)

And your regexp ^s\\+ matches only string which starts with whitespaces or tabs. 并且您的regexp ^s\\+仅匹配以空格或制表符开头的字符串。 Because you use ^ . 因为您使用^ 。 But I think you just want to replace consective whitespaces or tabs with one space. 但是我认为您只想用一个空格替换传统的空格或制表符。 So then, above code will be as below.(Just remove ^ in the regexp) 因此，上面的代码将如下所示（只需在正则表达式中删除^ ）

line = re.sub(r"\s+", " ", line, flags = re.MULTILINE)

Now, each string in line are separated just one space. 现在，行中的每个字符串都只分隔一个空格。 So line.split(' ') will work as you want. 因此， line.split(' ')可以根据需要工作。

Next, e_line[0] returns first element of e_line which is 1st column of the line. 接下来， e_line[0]返回e_line第一个元素，即该行的第一列。 But you want to skip first 3 columns and get 4th column. 但是您想跳过前3列并获得第4列。 You can do like this: 您可以这样：

e_line = line.split(' ')
real_line = e_line[3]

OK. 好。 Now entire code is look like this. 现在整个代码看起来像这样。

for line in lines:#<---I also changed here because there is no need to skip first 22 lines in your example.
    line = re.sub(r"\s+", " ", line)
    e_line = line.split(' ')
    real_line = e_line[3]
    print real_line

output: 输出：

14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade

PS: PS：

This line can become more pythonic. 这条线可以变得更加pythonic。

before: 之前：

for line in lines[22:len(lines)]:

after: 后：

for line in lines[22:]:

And, you don't need to use flags = re.MULTILINE , because line is single-line in the for-loop. 而且，你也不需要使用flags = re.MULTILINE ，因为line是单行的for循环。

Answer 3

You don't need to use regex for this. 您不需要为此使用正则表达式。 The csv module can handle tab-separated files too: csv模块也可以处理制表符分隔的文件：

import csv

filereader = csv.reader(open('test.tsv', 'rb'), delimiter='\t')
path_list = [row[3].split(';') for row in filereader]

print(path_list)

python从tsv文件链接列表

问题描述

3 个解决方案

解决方案1
2 2014-04-07 09:05:35

解决方案2
1 2014-04-07 13:09:54

解决方案3
1 2014-04-07 13:47:30

python从tsv文件链接列表

问题描述

3 个解决方案

解决方案1 2 2014-04-07 09:05:35

解决方案2 1 2014-04-07 13:09:54

解决方案3 1 2014-04-07 13:47:30

解决方案1
2 2014-04-07 09:05:35

解决方案2
1 2014-04-07 13:09:54

解决方案3
1 2014-04-07 13:47:30