简体   繁体   中英

How to read a CSV with Pandas and only read it into 1 column without a Sep or Delimiter

I have a txt file which formed of many email password combinations, the problem is it is full of symbols at the start middle or end. These can all be replaced using regex but my problem is reading the txt file and keeping all the data in 1 column. A Delimiter or Sep cannot be used as each line contain so many different symbols. Even the default ',' is not viable as come lines start with ',' so it would keep no data.

I already have a script which can find only the emails and remove noise using pandas and regex, but the initial read is my problem. Ive heard of using the python engine over the c engine but doing so causes some columns to show NaN and put the rest of the email pass combo in column 2 respectively.

with open(self.breach_file, 'r', encoding='utf-8') as breach_file:
            found_reader = pd.read_csv(breach_file, names=['Email'], dtype={'Email':str}, quoting=csv.QUOTE_NONE, engine='c')
            found_reader = pd.DataFrame(found_reader)
            found_reader['Email'] = found_reader['Email'].replace(symbol_dictionary_colon, ':', regex=True).replace(symbol_dictionary_no_space, '', regex=True)
            found_reader = found_reader.str.replace('?', '', regex=True).str.strip()
            loaded_list = found_reader.str.replace(symbol_dictionary_first_char, '', regex=True)
        breach_file.close()

I just want the data to be read in 1 column no matter what symbol the line starts with. Any help?

PS I have tried using 2 columns and then if column 1 is NaN then create a new column with col 1 and 2 joined, but this is doesn't provide a feasible solution.

So your file contains only the info of one column, or are there other infos as only the password? How big is your file?

If it is not big, you can do something like:

with open(self.breach_file, 'r', encoding='utf-8') as breach_file:
    passwords= breach_file.readlines()

pd.DataFrame({'passwords': passwords})

If it is larger, you could read line by line and add the lines each at one time to your dataframe (but this might be slow). You could also try to use the read_fwf function, which expects fixed width files and thus doesn't look for field separaters. Apparently it does not require the file to contain lines of the same length. It would look like:

pd.read_fwf('fake_fixed.txt', widths= [100])

You only have to make sure, you use a width that is at min as large as the longest password.

Another possibilty is to use

pd.read_csv('fake_fixed.txt', sep='\n')

So you make sure the lines don't get split (assuming your lines are separated by newlines. This way you could even use a custom converter to parse out the email addresses (in case you really need only the info of one column), that might save some space.

Answer:

found_reader = pd.read_csv(breach_file, names=['Email'], dtype={'Email':str}, delimiter='\n', quoting=csv.QUOTE_NONE, engine='c')

Delimiter or Sep both work.

Credit: https://stackoverflow.com/users/6925185/jottbe

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM