简体   繁体   English

Python Regex:如何使用正则表达式读取多行文件,并从每行中提取单词以创建两个不同的列表

[英]Python Regex: How do I use regular expression to read in a file with multiple lines, and extract words from each line to create two different lists

country_names.txt is a file with multiple lines, each line containing a European country and a Asian country. country_names.txt 是一个多行文件,每行包含一个欧洲国家和一个亚洲国家。 Read in each line of text until there is a line with the country names.读入每一行文本,直到有一行包含国家名称。

Example line inside text file: <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>文本文件中的示例行: <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>

How do I use ONLY ONE regular expression to extract a European country and a Asian country from any line that contains two countries.如何使用一个正则表达式从包含两个国家/地区的任何行中提取一个欧洲国家/地区和一个亚洲国家/地区。 After extracting the countries, store the European country in a list of European country names and store the Asian country in a list of Asian country names.提取国家后,将欧洲国家存储在欧洲国家名称列表中,将亚洲国家存储在亚洲国家名称列表中。

When all the lines have been read in, print a count of how many European countries and Asian countries have been read in.当所有的行都被读入后,打印出有多少欧洲国家和亚洲国家被读入。

Currently, this is what I have:目前,这就是我所拥有的:

import re

with open('country_names.txt') as infile:

for line in infile:

        countries = re.findall("", "", infile) # regex code inside ""s in parenthesis

european_countries = countries.group(1)

asian_countries = countries.group(2)

For one regex only you should use ^<td\\s*>([a-zA-Z]+)<\\/td\\s*>.*<td\\s*>([a-zA-Z]+)<\\/td\\s*> .对于一个正则表达式,您应该使用^<td\\s*>([a-zA-Z]+)<\\/td\\s*>.*<td\\s*>([a-zA-Z]+)<\\/td\\s*> You can play with it here: https://regex101.com/r/q9XHDD/1你可以在这里玩它: https : //regex101.com/r/q9XHDD/1

When running it on your example you'll get:在您的示例上运行它时,您将获得:

>>> re.findall("^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*", "<td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>")
[('England', 'Japan')]

My suggestion to you is not to use re.findall but to use re.match and then you code should be我对你的建议是不要使用re.findall而是使用re.match然后你的代码应该是

import re

regex = "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*"
eu_countries = []
as_countries = []
with open('country_names.txt') as infile:
   for line in infile:
        match = re.match(regex, line )
        if match:
            eu_countries.append(match.group(1))
            as_countries.append(match.group(2))

You can use this regex to pull out the countries.您可以使用此正则表达式来提取国家/地区。 <\\s*(td)[^>]*>(\\w*)<\\s*/\\s*(td)> This is selecting the tags where the text inside the tags is a word (ie not numbers) <\\s*(td)[^>]*>(\\w*)<\\s*/\\s*(td)>这是选择标签内的文本是一个单词(即不是数字)的标签

This returns a list of tuples [('td', 'England', 'td'), ('td', 'Japan', 'td')]这将返回一个元组列表[('td', 'England', 'td'), ('td', 'Japan', 'td')]

I then map over and select the 2nd element in the tuple which is the country.然后我映射并选择元组中的第二个元素,即国家/地区。

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
countries = list(map(lambda x: x[1], re.findall(regex, line)))
print(countries)  # ['England', 'Japan']

One thing to note is you need to use line instead of infile in the loop.需要注意的一件事是您需要在循环中使用line而不是infile

So to put it together:所以把它放在一起:

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
european_countries = []
asian_countries = []

for line in infile:
    countries = list(map(lambda x: x[1], re.findall(regex, line)))
    european_countries.append(countries[0])
    asian_countries.append(countries[1])

Please note this will not work if you have other <td> tags with text in them.请注意,如果您有其他带有文本的<td>标签,这将不起作用。 Also the order of the countries is important for this code.对于此代码,国家/地区的顺序也很重要。 But a quick solution to your problem.但是可以快速解决您的问题。

f = open('country_names.txt', 'r')
line = f.readlines()
e_countries = []
a_countries = []
for i in line:
  line1 = i.split(', ')[0]
  line2 = i.split(', ')[1]
  e_countries.append(line1)
  a_countries.append(line2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从文件中的单词创建两个不同的列表? - How can I create two different lists from the words in a file? 如何将文件中的多行数据读入python? - How do I read multiple lines of data from a file into python? Python 正则表达式 - 查找不同行上两个特定单词之间的所有单词 - Python Regular Expression - Find all words between two specific words on different lines 如何从txt文件中提取不同行的两个单词之间的单词? - How to extract words between two words of different line from a txt file? 如何为文件中的每一行使用不同形式的正则表达式? - How can I use a different form of regex for each line in a file? Python / Regex-如何使用正则表达式从文件名中提取日期? - Python/Regex - How to extract date from filename using regular expression? 如何使用 Python 正则表达式匹配 PDF 文件中的多行 - How to use a Python Regex to match multiple lines from a PDF file Python:如何在正则表达式中使用“或”? - Python:How do I use “or” in a regular expression? Python-如何匹配文本文件中多行中的特定单词/数字并将它们存储在单独的列表中 - Python - how to match specific words / digits from multiple lines in a text file and store them in separate lists Python 正则表达式从多行获取两个单词之间的值 - Python regex get value between two words from multiple line
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM