简体   繁体   English

如何使用Pandas读取CSV并且只在没有Sep或Delimiter的情况下将其读入1列

[英]How to read a CSV with Pandas and only read it into 1 column without a Sep or Delimiter

I have a txt file which formed of many email password combinations, the problem is it is full of symbols at the start middle or end. 我有一个由许多电子邮件密码组合组成的txt文件,问题是它在开头中间或结尾处充满了符号。 These can all be replaced using regex but my problem is reading the txt file and keeping all the data in 1 column. 这些都可以使用正则表达式替换,但我的问题是读取txt文件并将所有数据保存在1列中。 A Delimiter or Sep cannot be used as each line contain so many different symbols. 不能使用分隔符或Sep,因为每行包含许多不同的符号。 Even the default ',' is not viable as come lines start with ',' so it would keep no data. 即使是默认的','也不可行,因为起始行以','开头,所以它不会保留任何数据。

I already have a script which can find only the emails and remove noise using pandas and regex, but the initial read is my problem. 我已经有一个脚本只能找到电子邮件并使用pandas和regex删除噪音,但最初的读取是我的问题。 Ive heard of using the python engine over the c engine but doing so causes some columns to show NaN and put the rest of the email pass combo in column 2 respectively. 我听说过在c引擎上使用python引擎但这样做会导致一些列显示NaN并将其余的电子邮件组合分别放在第2列中。

with open(self.breach_file, 'r', encoding='utf-8') as breach_file:
            found_reader = pd.read_csv(breach_file, names=['Email'], dtype={'Email':str}, quoting=csv.QUOTE_NONE, engine='c')
            found_reader = pd.DataFrame(found_reader)
            found_reader['Email'] = found_reader['Email'].replace(symbol_dictionary_colon, ':', regex=True).replace(symbol_dictionary_no_space, '', regex=True)
            found_reader = found_reader.str.replace('?', '', regex=True).str.strip()
            loaded_list = found_reader.str.replace(symbol_dictionary_first_char, '', regex=True)
        breach_file.close()

I just want the data to be read in 1 column no matter what symbol the line starts with. 我只是希望数据在1列中读取,无论该行以什么符号开头。 Any help? 有帮助吗?

PS I have tried using 2 columns and then if column 1 is NaN then create a new column with col 1 and 2 joined, but this is doesn't provide a feasible solution. PS我尝试使用2列,然后如果第1列是NaN则创建一个新的列,其中col 1和2连接,但这不提供可行的解决方案。

So your file contains only the info of one column, or are there other infos as only the password? 所以你的文件只包含一列的信息,还是只有其他信息作为密码? How big is your file? 你的档案有多大?

If it is not big, you can do something like: 如果它不大,你可以这样做:

with open(self.breach_file, 'r', encoding='utf-8') as breach_file:
    passwords= breach_file.readlines()

pd.DataFrame({'passwords': passwords})

If it is larger, you could read line by line and add the lines each at one time to your dataframe (but this might be slow). 如果它更大,您可以逐行读取并将每个行一次添加到您的数据帧(但这可能很慢)。 You could also try to use the read_fwf function, which expects fixed width files and thus doesn't look for field separaters. 您还可以尝试使用read_fwf函数,该函数需要固定宽度的文件,因此不会查找字段分隔符。 Apparently it does not require the file to contain lines of the same length. 显然,它不要求文件包含相同长度的行。 It would look like: 它看起来像:

pd.read_fwf('fake_fixed.txt', widths= [100])

You only have to make sure, you use a width that is at min as large as the longest password. 您只需要确保使用最长密码的最小宽度。

Another possibilty is to use 另一种可能性是使用

pd.read_csv('fake_fixed.txt', sep='\n')

So you make sure the lines don't get split (assuming your lines are separated by newlines. This way you could even use a custom converter to parse out the email addresses (in case you really need only the info of one column), that might save some space. 因此,请确保线条不会被拆分(假设您的线条被换行分隔。这样您甚至可以使用自定义转换器来解析电子邮件地址(如果您确实只需要一列的信息),那么可能会节省一些空间。

Answer: 回答:

found_reader = pd.read_csv(breach_file, names=['Email'], dtype={'Email':str}, delimiter='\n', quoting=csv.QUOTE_NONE, engine='c')

Delimiter or Sep both work. Delimiter或Sep都可以工作。

Credit: https://stackoverflow.com/users/6925185/jottbe 信用: https//stackoverflow.com/users/6925185/jottbe

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM