[英]Python Regex: How do I use regular expression to read in a file with multiple lines, and extract words from each line to create two different lists
country_names.txt is a file with multiple lines, each line containing a European country and a Asian country. country_names.txt 是一个多行文件,每行包含一个欧洲国家和一个亚洲国家。 Read in each line of text until there is a line with the country names.读入每一行文本,直到有一行包含国家名称。
Example line inside text file: <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>
文本文件中的示例行: <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>
How do I use ONLY ONE regular expression to extract a European country and a Asian country from any line that contains two countries.如何仅使用一个正则表达式从包含两个国家/地区的任何行中提取一个欧洲国家/地区和一个亚洲国家/地区。 After extracting the countries, store the European country in a list of European country names and store the Asian country in a list of Asian country names.提取国家后,将欧洲国家存储在欧洲国家名称列表中,将亚洲国家存储在亚洲国家名称列表中。
When all the lines have been read in, print a count of how many European countries and Asian countries have been read in.当所有的行都被读入后,打印出有多少欧洲国家和亚洲国家被读入。
Currently, this is what I have:目前,这就是我所拥有的:
import re
with open('country_names.txt') as infile:
for line in infile:
countries = re.findall("", "", infile) # regex code inside ""s in parenthesis
european_countries = countries.group(1)
asian_countries = countries.group(2)
For one regex only you should use ^<td\\s*>([a-zA-Z]+)<\\/td\\s*>.*<td\\s*>([a-zA-Z]+)<\\/td\\s*>
.对于一个正则表达式,您应该使用^<td\\s*>([a-zA-Z]+)<\\/td\\s*>.*<td\\s*>([a-zA-Z]+)<\\/td\\s*>
。 You can play with it here: https://regex101.com/r/q9XHDD/1你可以在这里玩它: https : //regex101.com/r/q9XHDD/1
When running it on your example you'll get:在您的示例上运行它时,您将获得:
>>> re.findall("^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*", "<td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>")
[('England', 'Japan')]
My suggestion to you is not to use re.findall
but to use re.match
and then you code should be我对你的建议是不要使用re.findall
而是使用re.match
然后你的代码应该是
import re
regex = "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*"
eu_countries = []
as_countries = []
with open('country_names.txt') as infile:
for line in infile:
match = re.match(regex, line )
if match:
eu_countries.append(match.group(1))
as_countries.append(match.group(2))
You can use this regex to pull out the countries.您可以使用此正则表达式来提取国家/地区。 <\\s*(td)[^>]*>(\\w*)<\\s*/\\s*(td)>
This is selecting the tags where the text inside the tags is a word (ie not numbers) <\\s*(td)[^>]*>(\\w*)<\\s*/\\s*(td)>
这是选择标签内的文本是一个单词(即不是数字)的标签
This returns a list of tuples [('td', 'England', 'td'), ('td', 'Japan', 'td')]
这将返回一个元组列表[('td', 'England', 'td'), ('td', 'Japan', 'td')]
I then map over and select the 2nd element in the tuple which is the country.然后我映射并选择元组中的第二个元素,即国家/地区。
regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
countries = list(map(lambda x: x[1], re.findall(regex, line)))
print(countries) # ['England', 'Japan']
One thing to note is you need to use line
instead of infile
in the loop.需要注意的一件事是您需要在循环中使用line
而不是infile
。
So to put it together:所以把它放在一起:
regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
european_countries = []
asian_countries = []
for line in infile:
countries = list(map(lambda x: x[1], re.findall(regex, line)))
european_countries.append(countries[0])
asian_countries.append(countries[1])
Please note this will not work if you have other <td>
tags with text in them.请注意,如果您有其他带有文本的<td>
标签,这将不起作用。 Also the order of the countries is important for this code.对于此代码,国家/地区的顺序也很重要。 But a quick solution to your problem.但是可以快速解决您的问题。
f = open('country_names.txt', 'r')
line = f.readlines()
e_countries = []
a_countries = []
for i in line:
line1 = i.split(', ')[0]
line2 = i.split(', ')[1]
e_countries.append(line1)
a_countries.append(line2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.