简体   繁体   English

CSV 在不应该创建新行时

[英]CSV creating new lines when it shouldn't

I'm currently working on a project for myself, and that includes scraping this specific website.我目前正在为自己开发一个项目,其中包括抓取这个特定的网站。

My code currently looks like this:我的代码目前如下所示:

for i in range(0,4):
my_url = 'https://www.kickante.com.br/campanhas-crowdfunding?page='+str(i)
uclient = ureq(my_url)
page_html = uclient.read()
uclient.close()

page_soup = soup(page_html, 'html.parser')

containers = page_soup.find_all("div", {"class":"campaign-card-wrapper views-row"})
for container in containers:
    #Achando os títulos das campanhas
    titleCampaignBruto = container.div.div.a.img["title"].replace('Crowdfunding para: ', '')
    titleCampaignParsed = titleCampaignBruto.strip().replace(",", ";")
    #Achando o valor da campanha
    arrecadadoFind = container.div.find_all("div",{"class":"funding-raised"})
    arrecadado = arrecadadoFind[0].text.strip().replace(",", ".")

    #Número de doadores
    doadoresBruto = container.div.find_all('span', {"class":"contributors-value"})
    doadoresParsed = doadoresBruto[0].text.strip().replace(",",";")

    #target da campanha
    fundingGoal = container.div.find_all('div', {"class":"funding-progress"})
    quantoArrecadado = fundingGoal[0].text.strip().replace(",",";")

    #Descricao da campanha
    descricaoBruta = container.div.find_all('div', {"class":"field field-name-field-short-description field-type-text-long field-label-hidden"})
    descricaoParsed = descricaoBruta[0].text.strip().replace(",",";")

    #link da campanha
    linkCampanha = container.div.find_all('href')
    print("Título da campanha: " + titleCampaignParsed)
    print("Valor da campanha: " +arrecadado)
    print("Doadores: "+ doadoresParsed)
    print("target: " + quantoArrecadado)
    print("descricao: " + descricaoParsed)

    f.write(titleCampaignParsed + "," + arrecadado + "," + doadoresParsed + "," + quantoArrecadado+ "," + descricaoParsed.replace("," ,";") + "\n")
i = i+1
f.close()

When I open the csv file it generated, I see that some lines are broken where they shouldn't be (example: See line 31 on the csv file ).当我打开它生成的 csv 文件时,我看到有些行在不应该出现的地方被破坏(例如:参见 csv 文件的第 31 行)。 That line should be a part of the previous line (line 30) as the body of the description.该行应该是前一行(第 30 行)的一部分,作为描述的主体。

Does anyone have an idea of what can be causing that?有谁知道是什么原因造成的? Thanks in advance.提前致谢。

Some of the text you're writing to CSV might contain newlines.您写给 CSV 的某些文本可能包含换行符。 You can remove them like so:您可以像这样删除它们:

csv_line_entries = [
    titleCampaignParsed, arrecadado,  doadoresParsed, 
    quantoArrecadado, descricaoParsed.replace("," ,";")
]
csv_line = ','.join([
    entry.replace('\n', ' ') for entry in csv_line_entries
])
f.write(csv_line + '\n')

Cause of the bug错误的原因

The strip() method removes only leading and trailing newlines/whitespace. strip()方法仅删除前导和尾随换行符/空格。

import bs4
soup = bs4.BeautifulSoup('<p>Whatever\nelse\n</p>')
soup.find('p').text.strip()
>>> 'Whatever\nelse'

Notice that the inner \n is not removed.请注意,内部的\n没有被删除。

You have newlines in the middle of the text.文本中间有换行符。 strip() only removes whitespace on the start and end of a string, so you need to use replace('\n','') instead. strip()仅删除字符串开头和结尾的空格,因此您需要使用replace('\n','')代替。 This replaces all of the newlines \n with nothing ''这将所有换行符\n替换为空''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM