简体   繁体   中英

Extracting email addresses and names from a text file

I will try to explain the problem as good as I can. I have a text file with email addresses and names. It looks like this: Barb Beney "de.mariof@vienna.aa", "Beny Beney" bet@catering.at ,etc....all in the same line. This is just an example and I have like thousands of such data in one big text file. I want to extract the emails and names so that I get something like this in the end:

Beny Beney bet@catering.at - separate, next to each other, in one line and without quote marks. And in the end it should eliminate all duplicate addresses from the file.

I wrote the code for extracting email addresses and it works, but I don't know how to do the rest. How to extract the names put it in one line as the addresses and eliminate duplicates. I hope I described it properly so you know what I'm trying to do. This is the code I have:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
using System.IO;

namespace Email
class Program
    static void Main(string[] args)
        ExtractEmails(@"C:\Users\drake\Desktop\New.txt", @"C:\Users\drake\Desktop\Email.txt");   

    public static void ExtractEmails(string inFilePath, string outFilePath)
        string data = File.ReadAllText(inFilePath);

        Regex emailRegex = new Regex(@"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*",

        MatchCollection emailMatches = emailRegex.Matches(data);

        StringBuilder sb = new StringBuilder();

        foreach (Match emailMatch in emailMatches)


        File.WriteAllText(outFilePath, sb.ToString());

} }

Welcome you can use this code and it will work on file made by creating new file which will contain all e-mails without duplicates:

    static void Main(string[] args)
        TextWriter w = File.CreateText(@"C:\Users\drake\Desktop\NonDuplicateEmails.txt");
        ExtractEmails(@"C:\Users\drake\Desktop\New.txt", @"C:\Users\drake\Desktop\Email.txt");
        TextReader r = File.OpenText(@"C:\Users\drake\Desktop\Email.txt");
        RemovingAllDupes(r, w);

    public static void RemovingAllDupes(TextReader reader, TextWriter writer)
        string currentLine;
        HashSet<string> previousLines = new HashSet<string>();

        while ((currentLine = reader.ReadLine()) != null)
            // Add returns true if it was actually added,
            // false if it was already there
            if (previousLines.Add(currentLine))

For the new desired formatting, you could do something like this:

private string[] parseEmails(string bigStringiIn){

string[] output;
string bigString;

bigString = bigStringiIn.Replace("\"", "");

output = bigString.Slit(",".ToCharArray());

return output;

it takes the string with the mail adresses, replaces the quote marks, then splits the string into a string array with the format: name lastname email@some.com

for the duplicated entries deletion, a nested for should do the trick, checking (maybe after a .Split()) for matching strings.

you can also use this code with big files:

    static void Main(string[] args)
        ExtractEmails(@"C:\Users\drake\Desktop\New.txt", @"C:\Users\drake\Desktop\Email.txt");
        var sr = new StreamReader(File.OpenRead(@"C:\Users\drake\Desktop\Email.txt"));
        var sw = new StreamWriter(File.OpenWrite(@"C:\Users\drake\Desktop\NonDuplicateEmails.txt"));
        RemovingAllDupes(sr, sw);

    public static void RemovingAllDupes(StreamReader str, StreamWriter stw)

        var lines = new HashSet<int>();
        while (!str.EndOfStream)
            string line = str.ReadLine();
            int hc = line.GetHashCode();
            if (lines.Contains(hc))


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM