如何从python中的网络抓取数据写入csv文件

Question

I am trying to scrape data from web pages and able to scrape also.我正在尝试从网页中抓取数据并且也能够抓取。 After using below script getting all div class data but I am confused how to write data in CSV file like.使用下面的脚本获取所有 div 类数据后，但我很困惑如何在 CSV 文件中写入数据。

First Data in the first name column Last name data in last name column .名字列中的第一个数据姓氏列中的姓氏数据。 . .

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

for i in range(len(name_box)):
    data = name_box[i].text.strip()

Data:数据：

Information Type
Individual

First Name
KACHAM
Middle Name

Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street  Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town

Pin Code
500033
Office Number
04040151614
Fax Number

Website URL

Authority Name

Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1


Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area  (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town

Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026

above is the data getting after run above code.以上是运行以上代码后得到的数据。

Edit编辑

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    print (data)
    fname = 'out.csv'
    with open(fname) as f:
        next(f)
        for line in f:
            head = []
            value = []
            for row in line:
                head.append(row)
            print (row)

Expected预期的

Information Type | First  | Middle Name | Last Name | ......
Individual       | KACHAM |             | RAJESHWAR | .....

I have 200 url but all url data is not same means some of these missing.我有 200 个 url，但所有 url 数据都不相同意味着其中一些丢失了。 I want to write such way if data not avaialble then write anotthing just blank.如果数据不可用，我想这样写，然后写一些空白。

Please suggest.请建议。 Thank you in advance先感谢您

Answer 1

to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label要写入 csv，您需要知道 head 和 body 中的值应该是什么，在这种情况下，head 值应该是 html element contains <label

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

heads = []
values = []

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    dataHTML = str(name_box[i])
    if 'PInfoType' in dataHTML:
        # <div class="col-md-3 col-sm-3" id="PInfoType">
        # empty value, maybe additional data for "Information Type"
        continue

    if 'for="2"' in dataHTML:
        # <label for="2">No</label>
        # it should be head but actually value
        values.append(data)

    elif '<label' in dataHTML:
        # <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
        # head or top row
        heads.append(data)

    else:
        # <div class="col-md-3 col-sm-3">Individual</div>
        # value for second row
        values.append(data)

csvData = ', '.join(heads) + '\n' + ', '.join(values)    
with open("results.csv", 'w') as f:
    f.write(csvData)

print "finish."

Answer 2

Question : How to write csv file from scraped data问题：如何从抓取的数据中写入 csv 文件

Read the Data into a dict and use csv.DictWriter(... to write to CSV file.将Data读入dict并使用csv.DictWriter(...写入 CSV 文件。
Documentations about: csv.DictWriter while next break Mapping Types — dict有关文档： csv.DictWriter while next break 映射类型 — dict

Skip the first line, as it's the title跳过第一行，因为它是标题
Loop Data lines循环Data线
1. key = next(data)
2. value = next(data)
3. Break loop if no further data如果没有更多数据，则中断循环
4. Build dict[key] = value构建dict[key] = value
After finishing the loop, write dict to CSV file完成循环后，将dict写入 CSV 文件

Output :输出：

 {'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

如何从python中的网络抓取数据写入csv文件

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-11-04 08:40:46

解决方案2
0 2018-11-03 09:42:52

如何从python中的网络抓取数据写入csv文件

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-11-04 08:40:46

解决方案2 0 2018-11-03 09:42:52

解决方案1
1 已采纳 2018-11-04 08:40:46

解决方案2
0 2018-11-03 09:42:52