[英]Extract e-mail addresses from .txt files in python
I would like to parse out e-mail addresses from several text files in Python. 我想从Python中的几个文本文件中解析出电子邮件地址。 In a first attempt, I tried to get the following element that includes an e-mail address from a list of strings (
'2To whom correspondence should be addressed. E-mail: joachim+pnas@uci.edu.\\n'
). 在第一次尝试中,我尝试从字符串列表中获取包含电子邮件地址的以下元素(
'2To whom correspondence should be addressed. E-mail: joachim+pnas@uci.edu.\\n'
)。
When I try to find the list element that includes the e-mail address via i.find("@") == 0
it does not give me the content[i]
. 当我尝试通过
i.find("@") == 0
查找包含电子邮件地址的列表元素时,它没有给我content[i]
。 Am I misunderstanding the .find()
function? 我误解了
.find()
函数吗? Is there a better way to do this? 有一个更好的方法吗?
from os import listdir
TextFileList = []
PathInput = "C:/Users/p282705/Desktop/PythonProjects/ExtractingEmailList/text/"
# Count the number of different files you have!
for filename in listdir(PathInput):
if filename.endswith(".txt"): # In case you accidentally put other files in directory
TextFileList.append(filename)
for i in TextFileList:
file = open(PathInput + i, 'r')
content = file.readlines()
file.close()
for i in content:
if i.find("@") == 0:
print(i)
The standard way of checking whether a string contains a character, in Python, is using the in
operator . 在Python中,检查字符串是否包含字符的标准方法是使用
in
运算符 。 In your case, that would be: 您的情况是:
for i in content:
if "@" in i:
print(i)
The find
method, as you where using, returns the position where the @
character is located , starting at 0, as described in the Python official documentation . 如您所使用的那样,
find
方法返回@
字符所在的位置 ,从0开始,如Python官方文档中所述 。
For instance, in the string abc@google.com
, it will return 3. In case the character is not located, it will return -1. 例如,在字符串
abc@google.com
,它将返回3。如果未找到字符,则它将返回-1。 The equivalent code would be: 等效代码为:
for i in content:
if i.find("@") != -1:
print(i)
However, this is considered unpythonic and the in
operator usage is preferred. 但是,这被认为是非Python的 ,并且
in
运算符的用法是首选。
'Find' function in python returns the index number of that character in a string. python中的“查找”功能返回字符串中该字符的索引号。 Maybe you can try this?
也许您可以尝试一下?
list = i.split(' ') # To split the string in words
for x in list: # search each word in list for @ character
if x.find("@") != -1:
print(x)
Find returns the index if you find the substring you are searching for. 如果找到要搜索的子字符串,Find返回索引。 This isn't correct for what you are trying to do.
这与您要执行的操作不正确。
You would be better using a Regular Expression or RE to search for an occurence of @. 您最好使用正则表达式或RE搜索@的出现。 In your case, you may come into as situation where there are more than one email address per line (Again I don't know your input data so I can't take a guess)
在您的情况下,您可能会遇到这样的情况:每行有一个以上的电子邮件地址(同样,我不知道您的输入数据,所以我无法猜测)
Something along these lines would benefit you: 这些方针将使您受益:
import re
for i in content:
findEmail = re.search(r'[\w\.-]+@[\w\.-]+', i)
if findEmail:
print(findEmail.group(0))
You would need to adjust this for valid email addresses... I'm not entirely sure if you can have symbols like +... 您需要针对有效的电子邮件地址进行调整...我不确定是否可以使用+等符号。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.