简体   繁体   English

如何从python中的网络抓取数据写入csv文件

[英]How to write csv file from scraped data from web in python

I am trying to scrape data from web pages and able to scrape also.我正在尝试从网页中抓取数据并且也能够抓取。 After using below script getting all div class data but I am confused how to write data in CSV file like.使用下面的脚本获取所有 div 类数据后,但我很困惑如何在 CSV 文件中写入数据。

First Data in the first name column Last name data in last name column .名字列中的第一个数据 姓氏列中的姓氏数据。 . .

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

for i in range(len(name_box)):
    data = name_box[i].text.strip()

Data:数据:

Information Type
Individual

First Name
KACHAM
Middle Name

Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street  Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town

Pin Code
500033
Office Number
04040151614
Fax Number

Website URL

Authority Name

Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1


Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area  (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town

Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026

above is the data getting after run above code.以上是运行以上代码后得到的数据。

Edit编辑

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    print (data)
    fname = 'out.csv'
    with open(fname) as f:
        next(f)
        for line in f:
            head = []
            value = []
            for row in line:
                head.append(row)
            print (row)

Expected预期的

Information Type | First  | Middle Name | Last Name | ......
Individual       | KACHAM |             | RAJESHWAR | .....

I have 200 url but all url data is not same means some of these missing.我有 200 个 url,但所有 url 数据都不相同意味着其中一些丢失了。 I want to write such way if data not avaialble then write anotthing just blank.如果数据不可用,我想这样写,然后写一些空白。

Please suggest.请建议。 Thank you in advance先感谢您

to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label要写入 csv,您需要知道 head 和 body 中的值应该是什么,在这种情况下,head 值应该是 html element contains <label

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

heads = []
values = []

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    dataHTML = str(name_box[i])
    if 'PInfoType' in dataHTML:
        # <div class="col-md-3 col-sm-3" id="PInfoType">
        # empty value, maybe additional data for "Information Type"
        continue

    if 'for="2"' in dataHTML:
        # <label for="2">No</label>
        # it should be head but actually value
        values.append(data)

    elif '<label' in dataHTML:
        # <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
        # head or top row
        heads.append(data)

    else:
        # <div class="col-md-3 col-sm-3">Individual</div>
        # value for second row
        values.append(data)

csvData = ', '.join(heads) + '\n' + ', '.join(values)    
with open("results.csv", 'w') as f:
    f.write(csvData)

print "finish."

Question : How to write csv file from scraped data问题:如何从抓取的数据中写入 csv 文件

Read the Data into a dict and use csv.DictWriter(... to write to CSV file.Data读入dict并使用csv.DictWriter(...写入 CSV 文件。
Documentations about: csv.DictWriter while next break Mapping Types — dict有关文档: csv.DictWriter while next break映射类型 — dict

  1. Skip the first line, as it's the title跳过第一行,因为它是标题
  2. Loop Data lines循环Data线
    1. key = next(data)
    2. value = next(data)
    3. Break loop if no further data如果没有更多数据,则中断循环
    4. Build dict[key] = value构建dict[key] = value
  3. After finishing the loop, write dict to CSV file完成循环后,将dict写入 CSV 文件

Output :输出

 {'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM