简体   繁体   中英

How to write csv file from scraped data from web in python

I am trying to scrape data from web pages and able to scrape also. After using below script getting all div class data but I am confused how to write data in CSV file like.

First Data in the first name column Last name data in last name column . .

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

for i in range(len(name_box)):
    data = name_box[i].text.strip()

Data:

Information Type
Individual

First Name
KACHAM
Middle Name

Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street  Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town

Pin Code
500033
Office Number
04040151614
Fax Number

Website URL

Authority Name

Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1


Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area  (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town

Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026

above is the data getting after run above code.

Edit

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    print (data)
    fname = 'out.csv'
    with open(fname) as f:
        next(f)
        for line in f:
            head = []
            value = []
            for row in line:
                head.append(row)
            print (row)

Expected

Information Type | First  | Middle Name | Last Name | ......
Individual       | KACHAM |             | RAJESHWAR | .....

I have 200 url but all url data is not same means some of these missing. I want to write such way if data not avaialble then write anotthing just blank.

Please suggest. Thank you in advance

to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

heads = []
values = []

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    dataHTML = str(name_box[i])
    if 'PInfoType' in dataHTML:
        # <div class="col-md-3 col-sm-3" id="PInfoType">
        # empty value, maybe additional data for "Information Type"
        continue

    if 'for="2"' in dataHTML:
        # <label for="2">No</label>
        # it should be head but actually value
        values.append(data)

    elif '<label' in dataHTML:
        # <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
        # head or top row
        heads.append(data)

    else:
        # <div class="col-md-3 col-sm-3">Individual</div>
        # value for second row
        values.append(data)

csvData = ', '.join(heads) + '\n' + ', '.join(values)    
with open("results.csv", 'w') as f:
    f.write(csvData)

print "finish."

Question : How to write csv file from scraped data

Read the Data into a dict and use csv.DictWriter(... to write to CSV file.
Documentations about: csv.DictWriter while next break Mapping Types — dict

  1. Skip the first line, as it's the title
  2. Loop Data lines
    1. key = next(data)
    2. value = next(data)
    3. Break loop if no further data
    4. Build dict[key] = value
  3. After finishing the loop, write dict to CSV file

Output :

 {'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM