简体   繁体   中英

csv writer output in one column

I have parsed some txt files and obtain the following list:

price = ['S-1', '20040319', '\t\t\t\tDIGIRAD CORP', '\t\t0000707388', 'price to be between $and $per ', 'S-1', '20040408', '\t\t\t\tBUCYRUS INTERNATIONAL INC', '\t\t0000740761', 'S-1', '20041027', '\t\t\t\tBUCYRUS INTERNATIONAL INC', '\t\t0000740761', 'S-1', '20050630', '\t\t\t\tSEALY CORP', '\t\t0000748015', 'S-1', '20140512', '\t\t\t\tCITIZENS FINANCIAL GROUP INC/RI', '\t\t0000759944', 'initial public offering and no public market exists for our shares. We anticipate that the initial public offering price will be between $and', 'S-1', '20110523', '\t\t\t\tCeres, Inc.', '\t\t0000767884', '    aggregate capital expenditures will be between $0.3&#160;million', 'S-1', '20171023', '\t\t\t\tBLUEGREEN VACATIONS CORP', '\t\t0000778946', '        <div style="margin-top:14pt; text-align:justify; line-height:12pt;">This is the initial public offering of Bluegreen Vacations Corporation. We are offering &#8194;&#8194; shares of our common stock and the selling shareholder identified in this prospectus is offering &#8194;&#8194; shares of our common stock. We will not receive any of the proceeds from the sale of shares by the selling shareholder. We anticipate that the initial public offering price of our common stock will be between $&#8199;&#8199; and $&#8199;&#8199; per ', 'S-1', '20020813', '\t\t\t\tVISTACARE INC', '\t\t0000787030']

My desired output is a csv file where each row starts with each " S-1 " document (corresponding to a different company). So I wrote a second list that creates sublists of the above starting in every 'S-1' :

price2 = [s.strip('|').split('|') for s in re.split(r'(?=S-1)', '|'.join(price)) if s]
print(price2)
[['S-1', '20040319', '\t\t\t\tDIGIRAD CORP', '\t\t0000707388', 'price to be between $and $per '], ['S-1', '20040408', '\t\t\t\tBUCYRUS INTERNATIONAL INC', '\t\t0000740761'], ['S-1', '20041027', '\t\t\t\tBUCYRUS INTERNATIONAL INC', '\t\t0000740761'], ['S-1', '20050630', '\t\t\t\tSEALY CORP', '\t\t0000748015'], ['S-1', '20140512', '\t\t\t\tCITIZENS FINANCIAL GROUP INC/RI', '\t\t0000759944', 'initial public offering and no public market exists for our shares. We anticipate that the initial public offering price will be between $and'], ['S-1', '20110523', '\t\t\t\tCeres, Inc.', '\t\t0000767884', '    aggregate capital expenditures will be between $0.3&#160;million'], ['S-1', '20171023', '\t\t\t\tBLUEGREEN VACATIONS CORP', '\t\t0000778946', '        <div style="margin-top:14pt; text-align:justify; line-height:12pt;">This is the initial public offering of Bluegreen Vacations Corporation. We are offering &#8194;&#8194; shares of our common stock and the selling shareholder identified in this prospectus is offering &#8194;&#8194; shares of our common stock. We will not receive any of the proceeds from the sale of shares by the selling shareholder. We anticipate that the initial public offering price of our common stock will be between $&#8199;&#8199; and $&#8199;&#8199; per '], ['S-1', '20020813', '\t\t\t\tVISTACARE INC', '\t\t0000787030']]

To which I then write on a csv file:

with open('pricerange.csv', 'w') as out_file:
    wr = csv.writer(out_file)
    wr.writerow(["file_form", "filedate", "coname", "cik", "price_range"])  # Headlines in  top row
    wr.writerows(price2)

The output looks fine, with each sublist being placed in a new row (ie each row starts with the 'S-1' element). 在此处输入图片说明

To clean even further the list, I still want to remove the special characters (eg '&#8194' ). So I create a new price3 list:

price3 = re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', str(price2)) #remove special characters or html tags in original .txt files
print(price3)
[['S-1', '20040319', '\t\t\t\tDIGIRAD CORP', '\t\t0000707388', 'price to be between $and $per '], ['S-1', '20040408', '\t\t\t\tBUCYRUS INTERNATIONAL INC', '\t\t0000740761'], ['S-1', '20041027', '\t\t\t\tBUCYRUS INTERNATIONAL INC', '\t\t0000740761'], ['S-1', '20050630', '\t\t\t\tSEALY CORP', '\t\t0000748015'], ['S-1', '20140512', '\t\t\t\tCITIZENS FINANCIAL GROUP INC/RI', '\t\t0000759944', 'initial public offering and no public market exists for our shares. We anticipate that the initial public offering price will be between $and'], ['S-1', '20110523', '\t\t\t\tCeres, Inc.', '\t\t0000767884', '    aggregate capital expenditures will be between $0.3million'], ['S-1', '20171023', '\t\t\t\tBLUEGREEN VACATIONS CORP', '\t\t0000778946', '        This is the initial public offering of Bluegreen Vacations Corporation. We are offering  shares of our common stock and the selling shareholder identified in this prospectus is offering  shares of our common stock. We will not receive any of the proceeds from the sale of shares by the selling shareholder. We anticipate that the initial public offering price of our common stock will be between $ and $ per '], ['S-1', '20020813', '\t\t\t\tVISTACARE INC', '\t\t0000787030']]

My surprise is that when I apply the code to transfer price3 into a csv file, all elements are kept within the first column. See output:

在此处输入图片说明

Any suggestions? I can't see where's the bug... Thank you so much

No bugs, Excel by default uses the ' ; ' instead of the ' , ', then in your example it inserts all the values ​​in the first column. To correctly view the csv, you have to change the excel settings the separator character from ' ; ' a ' , ' or save your csv file with the delimiter ' ; ', as follows:

with open('pricerange.csv', 'w') as out_file:
        wr = csv.writer(out_file, delimiter=";")
        wr.writerow(["file_form", "filedate", "coname", "cik", "price_range"])  # Headlines in  top row
        wr.writerows(price2)

There is no bug, the problem is that the type(price) is list and the type(price3) is string. When trying to write to file, the string is interpreted as a list of characters, so the code writes one character per line and gets the photo output:

list(price3)

['[',
 '[',
 "'",
 'S',
 '-',
 '1',
 "'",
 ',',
 ' ',
...

You must then transform the string price3 in the corresponding list before writing the csv file. To do this you can use this trick:

import ast
price3_str = re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', str(price2)) #remove special characters or html tags in original .txt files
price3 = ast.literal_eval(price3_str)

Now you can create the csv:

import csv
with open('pricerange3.csv', 'w') as out_file:
        wr = csv.writer(out_file, delimiter=";")
        wr.writerow(["file_form", "filedate", "coname", "cik", "price_range"])  # Headlines in  top row
        wr.writerows(price3)

You have problem with price3 because you converted price2 to string to use re.sub() and later writerows() has problem to write it because it needs list of rows but it gets only single string. And it treads string as list of chars and put every char in separated row.

You should use list comprehension to run re with every element on list separatelly.

EDIT: As Massifox noticed in comment original version didn't work correctly but I added internal for -loop and now it works correctly.

price3 = [[re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', item) for item in row] for row in price2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM