简体   繁体   English

从python中的.txt文件中提取电子邮件地址

[英]Extract e-mail addresses from .txt files in python

I would like to parse out e-mail addresses from several text files in Python. 我想从Python中的几个文本文件中解析出电子邮件地址。 In a first attempt, I tried to get the following element that includes an e-mail address from a list of strings ( '2To whom correspondence should be addressed. E-mail: joachim+pnas@uci.edu.\\n' ). 在第一次尝试中,我尝试从字符串列表中获取包含电子邮件地址的以下元素( '2To whom correspondence should be addressed. E-mail: joachim+pnas@uci.edu.\\n' )。

When I try to find the list element that includes the e-mail address via i.find("@") == 0 it does not give me the content[i] . 当我尝试通过i.find("@") == 0查找包含电子邮件地址的列表元素时,它没有给我content[i] Am I misunderstanding the .find() function? 我误解了.find()函数吗? Is there a better way to do this? 有一个更好的方法吗?

from os import listdir

TextFileList = []
PathInput = "C:/Users/p282705/Desktop/PythonProjects/ExtractingEmailList/text/"

# Count the number of different files you have!
for filename in listdir(PathInput):
    if filename.endswith(".txt"):  # In case you accidentally put other files in directory
        TextFileList.append(filename)

for i in TextFileList:
    file = open(PathInput + i, 'r')
    content = file.readlines()
    file.close()

for i in content:
    if i.find("@") == 0:
        print(i)

The standard way of checking whether a string contains a character, in Python, is using the in operator . 在Python中,检查字符串是否包含字符的标准方法是使用in运算符 In your case, that would be: 您的情况是:

for i in content:
    if "@" in i:
        print(i)

The find method, as you where using, returns the position where the @ character is located , starting at 0, as described in the Python official documentation . 如您所使用的那样, find方法返回@字符所在的位置 ,从0开始,如Python官方文档中所述

For instance, in the string abc@google.com , it will return 3. In case the character is not located, it will return -1. 例如,在字符串abc@google.com ,它将返回3。如果未找到字符,则它将返回-1。 The equivalent code would be: 等效代码为:

for i in content:
    if i.find("@") != -1:
        print(i)

However, this is considered unpythonic and the in operator usage is preferred. 但是,这被认为是非Python的 ,并且in运算符的用法是首选。

'Find' function in python returns the index number of that character in a string. python中的“查找”功能返回字符串中该字符的索引号。 Maybe you can try this? 也许您可以尝试一下?

list = i.split(' ') # To split the string in words
for x in list:    # search each word in list for @ character
    if x.find("@") != -1:
        print(x)

Find returns the index if you find the substring you are searching for. 如果找到要搜索的子字符串,Find返回索引。 This isn't correct for what you are trying to do. 这与您要执行的操作不正确。

You would be better using a Regular Expression or RE to search for an occurence of @. 您最好使用正则表达式或RE搜索@的出现。 In your case, you may come into as situation where there are more than one email address per line (Again I don't know your input data so I can't take a guess) 在您的情况下,您可能会遇到这样的情况:每行有一个以上的电子邮件地址(同样,我不知道您的输入数据,所以我无法猜测)

Something along these lines would benefit you: 这些方针将使您受益:

import re
for i in content:
    findEmail = re.search(r'[\w\.-]+@[\w\.-]+', i)
    if findEmail:
     print(findEmail.group(0))

You would need to adjust this for valid email addresses... I'm not entirely sure if you can have symbols like +... 您需要针对有效的电子邮件地址进行调整...我不确定是否可以使用+等符号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM