简体   繁体   中英

Write CSV file from scraped data with Beautiful Soup

This is how I scraped the data using Beautifulsoup.

comments =[]
users_list = []
users = driver.find_elements_by_class_name('_6lAjh')

for user in users:
    users_list.append(user.text)

i = 0
texts_list = []
texts = driver.find_elements_by_class_name('C4VMK')

for txt in texts:
    texts_list.append(txt.text.split(users_list[i])[1].replace("\r"," ").replace("\n"," "))
    i += 1
    comments_count = len(users_list)

for i in range(1, comments_count):
    user = users_list[i]
    text = texts_list[i]
    print("User ",user)
    print("Text ",text)
    print()
    comments.append(users_list[i])
    comments.append(texts_list[i])
    idxs = [m.start() for m in re.finditer('@', text)]
    for idx in idxs:
        handle = text[idx:].split(" ")[0]

print(handle)

This is the text data I have which are username, comments, and number of likes from instagram. ' heyyy 3w1 likeReply' -> 'heyyy' is comment in here, 3w means the comment was written 3weeks ago, 1 like is number of likes

print(comments)
['User1',
 ' 😱 3w1 likeReply',
 'User2',
 ' 💖 3w1 likeReply',
 'User3',
 ' Looking good! Collab, DM "bruteimpact.fashion 3wReply',
 'User4',
 ' heyyy 3w5 likeReply']

I want to save this into CSV file that looks like this(three columns- ID, Comments, likes_count):

ID  Comments  likes_count
User1 😱       0
User2 💖       1
User3 Looking good! Collab, DM "bruteimpact.fashion  0
User4 heyyy    5

so far this is the code I wrote but is far from the result I want to get and I do not know how to get to the final destination at all. Plus, I have no idea how to make separate 'likes_count' by detaching the number of likes from the comment data I have. However, I would be satisfied with CSV file with just "ID" and "Text" column without "likes_count". Please help me!

fields = ["User", "Text"]
rows = [comments]
filename = "insta_records.csv"
with open(filename, 'w', encoding='utf-8') as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields) 
    csvwriter.writerows(rows) 

You have flat list so you could use zip to group user and its comment

comments = ['User1',
 ' 😱 3w1 likeReply',
 'User2',
 ' 💖 3w1 likeReply',
 'User3',
 ' Looking good! Collab, DM "bruteimpact.fashion 3wReply',
 'User4',
 ' heyyy 3w5 likeReply']

rows = []
for user, text in zip(comments[::2], comments[1::2]):
    print(user, text)
    #rows.append([user, text])


fields = ["User", "Text"]
filename = "insta_records.csv"
with open(filename, 'w', encoding='utf-8') as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields) 
    csvwriter.writerows(rows) 

Result on screen

User1  😱 3w1 likeReply
User2  💖 3w1 likeReply
User3  Looking good! Collab, DM "bruteimpact.fashion 3wReply
User4  heyyy 3w5 likeReply

And in file

User,Text
User1, 😱 3w1 likeReply
User2, 💖 3w1 likeReply
User3," Looking good! Collab, DM ""bruteimpact.fashion 3wReply"
User4, heyyy 3w5 likeReply

To create other columns you would have to first edit comments - split() , replace() , slice [start:end] , etc.

rows = []
for user, text in zip(comments[::2], comments[1::2]):
    parts = text.rsplit(' ', 2)#[:-1]
    parts.insert(0, user)
    print(parts)
    rows.append(parts)

Result on screen

['User1', ' 😱', '3w1', 'likeReply']
['User2', ' 💖', '3w1', 'likeReply']
['User3', ' Looking good! Collab, DM', '"bruteimpact.fashion', '3wReply']
['User4', ' heyyy', '3w5', 'likeReply']

but there is missing space in '3wReply' so it doesn't split it correctly and it would need more work to split it correctly.

BTW: when you have 3w5 then you can split('w') to get ['3', '5'] but in HTML can be other text instead of w so it would need more work. Maybe using more complex rules in BeautifulSoup you could better split it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM