Python Regex：如何使用正则表达式读取多行文件，并从每行中提取单词以创建两个不同的列表

Question

country_names.txt is a file with multiple lines, each line containing a European country and a Asian country. country_names.txt 是一个多行文件，每行包含一个欧洲国家和一个亚洲国家。 Read in each line of text until there is a line with the country names.读入每一行文本，直到有一行包含国家名称。

Example line inside text file: <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>文本文件中的示例行： <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>

How do I use ONLY ONE regular expression to extract a European country and a Asian country from any line that contains two countries.如何仅使用一个正则表达式从包含两个国家/地区的任何行中提取一个欧洲国家/地区和一个亚洲国家/地区。 After extracting the countries, store the European country in a list of European country names and store the Asian country in a list of Asian country names.提取国家后，将欧洲国家存储在欧洲国家名称列表中，将亚洲国家存储在亚洲国家名称列表中。

When all the lines have been read in, print a count of how many European countries and Asian countries have been read in.当所有的行都被读入后，打印出有多少欧洲国家和亚洲国家被读入。

Currently, this is what I have:目前，这就是我所拥有的：

import re

with open('country_names.txt') as infile:

for line in infile:

        countries = re.findall("", "", infile) # regex code inside ""s in parenthesis

european_countries = countries.group(1)

asian_countries = countries.group(2)

Answer 1

For one regex only you should use ^<td\\s*>([a-zA-Z]+)<\\/td\\s*>.*<td\\s*>([a-zA-Z]+)<\\/td\\s*> .对于一个正则表达式，您应该使用^<td\\s*>([a-zA-Z]+)<\\/td\\s*>.*<td\\s*>([a-zA-Z]+)<\\/td\\s*> 。 You can play with it here: https://regex101.com/r/q9XHDD/1你可以在这里玩它： https : //regex101.com/r/q9XHDD/1

When running it on your example you'll get:在您的示例上运行它时，您将获得：

>>> re.findall("^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*", "<td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>")
[('England', 'Japan')]

My suggestion to you is not to use re.findall but to use re.match and then you code should be我对你的建议是不要使用re.findall而是使用re.match然后你的代码应该是

import re

regex = "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*"
eu_countries = []
as_countries = []
with open('country_names.txt') as infile:
   for line in infile:
        match = re.match(regex, line )
        if match:
            eu_countries.append(match.group(1))
            as_countries.append(match.group(2))

Answer 2

You can use this regex to pull out the countries.您可以使用此正则表达式来提取国家/地区。 <\\s*(td)[^>]*>(\\w*)<\\s*/\\s*(td)> This is selecting the tags where the text inside the tags is a word (ie not numbers) <\\s*(td)[^>]*>(\\w*)<\\s*/\\s*(td)>这是选择标签内的文本是一个单词（即不是数字）的标签

This returns a list of tuples [('td', 'England', 'td'), ('td', 'Japan', 'td')]这将返回一个元组列表[('td', 'England', 'td'), ('td', 'Japan', 'td')]

I then map over and select the 2nd element in the tuple which is the country.然后我映射并选择元组中的第二个元素，即国家/地区。

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
countries = list(map(lambda x: x[1], re.findall(regex, line)))
print(countries)  # ['England', 'Japan']

One thing to note is you need to use line instead of infile in the loop.需要注意的一件事是您需要在循环中使用line而不是infile 。

So to put it together:所以把它放在一起：

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
european_countries = []
asian_countries = []

for line in infile:
    countries = list(map(lambda x: x[1], re.findall(regex, line)))
    european_countries.append(countries[0])
    asian_countries.append(countries[1])

Please note this will not work if you have other <td> tags with text in them.请注意，如果您有其他带有文本的<td>标签，这将不起作用。 Also the order of the countries is important for this code.对于此代码，国家/地区的顺序也很重要。 But a quick solution to your problem.但是可以快速解决您的问题。

Answer 3

f = open('country_names.txt', 'r')
line = f.readlines()
e_countries = []
a_countries = []
for i in line:
  line1 = i.split(', ')[0]
  line2 = i.split(', ')[1]
  e_countries.append(line1)
  a_countries.append(line2)

Python Regex：如何使用正则表达式读取多行文件，并从每行中提取单词以创建两个不同的列表

问题描述

3 个解决方案

解决方案1
3 2019-12-03 18:17:11

解决方案2
1 2019-12-03 18:07:09

解决方案3
0 2019-12-03 18:06:35

Python Regex：如何使用正则表达式读取多行文件，并从每行中提取单词以创建两个不同的列表

问题描述

3 个解决方案

解决方案1 3 2019-12-03 18:17:11

解决方案2 1 2019-12-03 18:07:09

解决方案3 0 2019-12-03 18:06:35

解决方案1
3 2019-12-03 18:17:11

解决方案2
1 2019-12-03 18:07:09

解决方案3
0 2019-12-03 18:06:35