简体   繁体   中英

How to read data from .log file with pandas

i have a log file with data from 100 pages from a webscrape script. the.log file are read in log like this:

Title: Canon EF 100mm f/2.8L Macro IS USM
Price: 6�900 kr
Link: https://www.finn.no/bap/forsale/ad.html?finnkode=161065896
21-Oct-19 10:21:14 - Found:
Title: Canon EF 100mm f/2.8L Macro IS USM
Price: 7�500 kr
Link: https://www.finn.no/bap/forsale/ad.html?finnkode=155541389
21-Oct-19 10:21:14 - Found:
Title: Panasonic Lumix G 25mm F1.4 ASPH
Price: 3�200 kr
Link: https://www.finn.no/bap/forsale/ad.html?finnkode=161066674

I would like to import this data and send it to excel like

title           price      link
canon 100mm     6900kr     https

The approach need to be changed if the log file is not in the order you have shown. As the following function will always start to find the Title, Price and Link text and add to a list. To convert to dataframe the all list need to be equal length. Let me know if it works.

def log_to_frame(location="./datalake/file.log"):
    with open(location, mode='r', encoding='UTF-8') as f:
        title_list = []
        price_list = []
        link_list = []
        for line in f:
            if "Title" in line:
                title = line.split(": ")[1].rstrip()
                title_list.append(title)
            elif "Price" in line:
                price = line.split(": ")[1].replace("�", "").rstrip()
                price_list.append(title)
            elif "Link" in line:
                link = line.split(": ")[1].rstrip()
                link_list.append(title)
            else:
                pass
    main_df = pd.DataFrame({"title": title_list, "price": price_list, "link": link_list})
    return main_df


log_df = log_to_frame()
log_df.to_excel("log.xlsx", index=False)

You can load the data into a DataFrame as a normal table and then combine the columns using the DataFrame's log and reset_index functions. This assumes that there is only one ":" symbol on each line, separating the "key" column from the "value" column, and that every "record" has a line for every key.

import pandas as pd

p = pd.read_table("table.log", sep=':', header=None)
df = pd.DataFrame()
keys = set(p[0]) # set of all unique keys

for key in keys:
  # get all values with the current key and re-index them from 0...n
  col_data = p.loc[p[0]==key][1].reset_index(drop=True)
  # put this in a new column named after the key
  df[key] = col_data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM