簡體   English   中英

PySpark 未正確讀取 CSV

[英]PySpark is not reading CSV properly

我正在使用df.to_csv("preprocessed_data.csv")將數據從包含 318477 行的 Pandas 數據幀保存到 csv 文件中。 當我在另一個筆記本中加載這個文件時:

df = pd.read_csv("preprocessed_data.csv")
len(df)

# out: 318477

行數符合預期。 但是,當我嘗試使用 PySpark 加載數據集時:

spark_df = spark.read.format("csv")
                     .option("header", "true")
                     .option("mode", "DROPMALFORMED")
                     .load("preprocessed_data.csv")
spark_df.count()

# out: 6422020

或者

df_test = spark.sql("SELECT * FROM csv.`preprocessed_data.csv`")
df_test.count()

# out: 6422020

行數不正確。 它讀取的行數 6422020 是 csv 文件中的行數。 由於存在內容跨越多行的行(即https://imgur.com/a/qWd9jtq

我怎么解決這個問題? 我是否需要以某種方式在任何文本中沒有換行符的情況下保存 csv,或者我可以更具體地指定 PySpark 中的 csv 讀數?

這是我上一個問題的繼續,我現在了解問題更多鏈接

來自 CSV 文件的行:

120,teacher industrial design technology mabel park state high school,teach queensland,2018-10-07,brisbane,southern suburbs logan,education training,teaching secondary,mabel park state high school invites applications for a industrial design and technology teacher,,0,30,,0.0,0.03003003003003003
121,fabricatorinstaller,workplace access safety,2018-10-07,melbourne,bayside south eastern suburbs,trades services,welders boilermakers,trade qualified person with skills in welding and fabrication to assist in the manufacturing and installation of our custom height safety products,"<p>&nbsp;</p>
        <p><strong><em>*&nbsp; Secure long term role with genuine career path to supervisor</em></strong></p>
        <p><strong><em>*&nbsp; Competitive hourly rate with regular opportunity for overtime</em></strong></p>
        <p><strong><em>*&nbsp; Full on-the-job training</em></strong></p>
        <p><strong>About the&nbsp;role</strong></p>
        <p>Having recently won a significant new national contract we are looking for another trade qualified person with welding and fabrication skills to help manage increased demands on our production and installation departments.&nbsp; This role will
          see you involved in both manufacturing and on-site installation and there is a genuine career path to supervisor if that is your goal.&nbsp; Initially your role will require you to:-</p>
        <ul>
          <li>read and interpret drawings&nbsp;</li>
          <li>fabricate and assemble orders as required</li>
          <li>provide input to enhance factory processes</li>
          <li>pack&nbsp;and dispatch orders</li>
          <li>perform on-site installations (full training will be given)</li>
        </ul>
        <p><strong>About you</strong></p>
        <p>This role is ideal for a trade qualified person&nbsp;(welder, boilermaker, fabricator etc) with good hands-on skills who will enjoy&nbsp;dividing their time between&nbsp;factory/manufacturing and on-site installations.&nbsp; Because installations
          invariably take place on the roof, physical fitness is&nbsp;essential.</p>
        <p><strong>What we offer</strong></p>
        <ul>
          <li>A secure, long-term role with a successful, well-established organisation</li>
          <li>Full, ongoing on-the-job training</li>
          <li>Opportunity for career progression to supervisor for the right person</li>
          <li>Opportunity to work&nbsp;in a safe, supportive and friendly environment</li>
          <li>Competitive hourly rate with regular opportunities for overtime</li>
          <li>Occasional regional and interstate travel in response to major projects</li>
        </ul>
        <p><strong>How to apply</strong></p>
        <p>Please copy and paste the URL below into your browser (it is <em>not</em> a live link so&nbsp;must be copied and pasted).&nbsp; This will take you to our custom online application form which includes a number of screening questions&nbsp;and a
          profiling checklist which is an essential part of our application process.</p>
        <p><strong>https://exenet.expr3ss.com/jobDetails?selectJob=296&amp;</strong></p>
        <p>If you have any difficulties or would like more information please email <a class=""_2L3qcJ0"" data-contact-match=""true"" href=""mailto:gayle@exhr.com.au"">gayle@exhr.com.au</a> or phone <a class=""_2hhDNI-"" data-contact-match=""true"" href=""tel:0468 336 224"">0468 336 224</a>.</p>",0,30,full time,0.0,0.03003003003003003
122,boilermaker,rpm contracting qld pl,2018-10-07,brisbane,southern suburbs logan,trades services,welders boilermakers,perm rate 30 structural steel fab weld out located southside full time hours ongoing work ot modern clean facility offering great conditions,"<p>One of Australia's best engineering workshops is hiring!</p>
        <p>They have ongoing, rolling projects and need good people now.</p>
        <p>They are partnered with state and federal governments, international minerals and energy companies, and other market leading entities.</p>
        <p>The workshop is state of the art, clean, and well-managed. There is a genuine focus on the safety and wellbeing of their people.</p>
        <p>The facility and conditions are truly exceptional.</p>
        <p>Secure and long term positions are on offer for forward-thinking, cooperative and professional tradesmen.</p>
        <p>We are looking for qualified and/or ticketed boilermakers and 1st class welders that can offer high level trade skills.</p>
        <p>Equally important is a cooperative, team-orientated attitude and a willingness to become involved and take ownership of their important role in this company.</p>
        <p>They are building on a stable, permanent team, so candidates who step up can look forward to a secure future.</p>
        <p>The position is ongoing, offering full-time hours, exceptional conditions, and penalties.</p>
        <p>You require own car and licence, PPE and tools, relevant experience and to be available for an immediate start.</p>
        <p>Good luck and kind regards,</p>
        <p>RPM</p>",0,30,full time,0.0,0.03003003003003003



根據提供的示例,我嘗試使用以下代碼返回 3 行:

>>> df = spark.read.csv('file:///tmp/test.csv', sep=',', multiLine=True)
>>> df.count()
3

如果它仍然不適合您,我會嘗試強制熊貓使用引號和分隔符

這是由於在 Windows 機器中安裝了 pyspark。 如果您的系統中安裝了多個 Pyspark 實例。 然后出現這個問題。 通過重新安裝 Pyspark 應該可以解決這個問題。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM