如何使用python使用hadoop处理apache日志文件

Question

I am very newbie to hadoop and unable to understand the concept well, I had followed below process 我是hadoop的新手，无法很好地理解这个概念，我遵循以下过程

Installed Hadoop by seeing here 通过查看此处安装了Hadoop
Tried the basic examples in tutorial by seeing here and worcount example in python and working fine with them. 通过查看此处的 worcount示例和在Python中正常工作，尝试了本教程中的基本示例。

Actually what i am trying to do/the requirement i got is processing an apache log files in fedora(linux) located at /var/log/httpd with hadoop using python in the below format 实际上我正在尝试做/我得到的要求是使用以下格式的python使用hadoop处理位于/var/log/httpd fedora（linux）中的apache日志文件

IP address    Count of IP   Pages accessed by IP address

I know that apache log files will be of two kinds 我知道apache日志文件有两种

access_logs access_logs
error_logs error_logs

but i am really unable to understand the format of apache log files. 但是我真的无法理解apache日志文件的格式。

My apache log file content is something like below 我的Apache日志文件内容如下所示

::1 - - [29/Oct/2012:15:20:15 +0530] "GET /phpMyAdmin/ HTTP/1.1" 200 6961 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"
::1 - - [29/Oct/2012:15:20:16 +0530] "GET /phpMyAdmin/js/cross_framing_protection.js?ts=1336063073 HTTP/1.1" 200 331 "http://localhost/phpMyAdmin/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"
::1 - - [29/Oct/2012:15:20:16 +0530] "GET /phpMyAdmin/js/jquery/jquery-1.6.2.js?ts=1336063073 HTTP/1.1" 200 92285 "http://localhost/phpMyAdmin/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"

Can anyone please explain me the structure of above/apache log files 谁能解释一下上面/ apache日志文件的结构

I am confused on how to process the the log file with the data Ip address, countof ip address, pages accessed by ip address 我对如何处理数据IP地址，IP地址数，IP地址访问的页面的日志文件感到困惑

Can anyone let me know how we can process the apache log files with haddop using python and above information and store the result in the above mentioned format 谁能让我知道我们如何使用python和以上信息使用haddop处理apache日志文件，并将结果存储为上述格式

Also can anyone please provide a basic code in python for processing the apache log files in the above format, so that i will get an real time idea on how to process the files with python code and will extend them according to needs 任何人都可以用python提供基本代码来处理上述格式的apache日志文件，以便我能实时了解如何使用python代码处理文件并根据需要扩展它们

Answer 1

This is just a partial answer but I hope you will find it of use, if you need anything more specific please update your question with your code and the specific points you are getting stuck on. 这只是部分答案，但我希望您会发现它有用，如果您需要更具体的内容，请使用您的代码和遇到的具体问题更新您的问题。

file processing stuff 文件处理的东西

The Python docs explain file processing really well. Python文档很好地解释了文件处理。

If you want to monitor the log files in real-time (I think that's what your question meant...) then check out this question here . 如果您想实时监视日志文件（我想这就是您的问题的意思……），请在此处查看此问题。 It's also about monitoring a log file. 它还与监视日志文件有关。 I don't really like the accepted answer but there are lots of nice suggestions. 我不太喜欢接受的答案，但是有很多不错的建议。

line processing stuff 行处理的东西

Once you manage to get individual lines from the log file then you'll want to process them. 一旦您设法从日志文件中获取单独的行，则需要对其进行处理。 They are just strings so as long as you know the format it's pretty simple. 它们只是字符串，只要您知道格式就非常简单。 Again I refer to the python docs . 我再次参考python docs 。 In case you want to do anything intense you might want to check that out. 如果您想做任何激烈的事情，您可能需要检查一下。

Now given the format of the line you gave us: 现在给出给定的行格式：

EDIT given the actual format of log lines we can now make progress... 编辑给出的日志行的实际格式，我们现在可以取得进展...

So if you grab a line from a log file such that: 因此，如果您从日志文件中抓取一行，例如：

line = '::1 - - [29/Oct/2012:15:20:15 +0530] "GET /phpMyAdmin/ HTTP/1.1" 200 6961 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"'

First step is to split it up into different pieces. 第一步是将其分成不同的部分。 I make use of the fact that the date and time are surrounded in '[...]' 我利用以下事实：日期和时间用“ [...]”括起来

lElements = line.split('[')
lElements = lElements[0] + lElements[1].split(']')

This leaves us with: 这给我们留下了：

lElements[0] = '::1 - - ' #IPv6 localhost = ::1
lElements[1] = '29/Oct/2012:15:20:15 +0530'
lElements[2] = ' "GET /phpMyAdmin/ HTTP/1.1" 200 6961 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"'

The date element can be converted into a friendlier format date元素可以转换为更友好的格式

The 'url' element contains stuff about the actual request (the HTTP verb, HTTP version, a mysterious number and a bunch of user-agent stuff). “ url”元素包含有关实际请求的内容（HTTP动词，HTTP版本，一个神秘数字和一堆用户代理内容）。

EDIT Adding code to grab url and ip address. 编辑添加代码以获取URL和IP地址。 ignoreing time stuffs 忽略时间的东西

ip_address = lElements[0].split('-')[0] # I'm just throwing away those dashes. are they important?
http_info = lElements[2].split('"')[1] # = 'GET /phpMyAdmin/ HTTP/1.1'
url = http_info.split()[1]  # = '/phpMyAdmin/'

"""
so now we have the ip address and the url. the next bit of code updates a dictionary dAccessCount as the number of url accesses increases...
dAccessCount should be set to {} initially
"""

if ip_address in dAccessCount:
    if url in dAccessCount[ip_address]:
        dAccessCount[ip_address][url]+=1
    else:
        dAccessCount[ip_address][url]=1
else:
    dAccessCount[ip_address] = {url:1}

So the keys of dAccessCount are the all the ip addresses that have accessed any url, and the keys of dAccessCount[some_ip_address] are all the urls that that ip_address has accessed, and finally: dAccessCount[some_ip_address][some_url] = the number of times some_url was accessed from some_ip_address. 因此，dAccessCount的键是已访问任何URL的所有IP地址，而dAccessCount [some_ip_address]的键是ip_address已访问的所有URL，最后：dAccessCount [some_ip_address] [some_url] =次数从some_ip_address访问some_url。

如何使用python使用hadoop处理apache日志文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-11-02 10:12:40

如何使用python使用hadoop处理apache日志文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-11-02 10:12:40

解决方案1
2 已采纳 2012-11-02 10:12:40