HIVE分析日志

回复收藏

学习笔记

hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供完整的sql查询功能，可以将sql语句转换为MapReduce任务进行运行。
其优点是学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用，十分适合数据仓库的统计分析。

Hadoop安装
http://www.lishiming.net/thread-5637-1-1.html

Hadoop官方文档
https://cwiki.apache.org/confluence/display/Hive/Home

一、HIVE安装
HIVE下载合适版本，支持对应的Hadoop版本 http://hive.apache.org/releases.html

在指定的目录下
解压 tar zxvf hive-0.11.0.tar.gz

配置环境变量

vi /etc/profile
export HIVE_INSTALL=/usr/local/hive-0.11.0
export PATH=$PATH:$HIVE_INSTALL/bin

1、交互模式

hive> show tables;
OK
Time taken: 6.896 seconds

第一次执行时慢，是正在机器上创建metastore数据库，数据库放置在hive命令执行下的目录中，metastore_db

2、非交互模式

hive -e 'show tables'
如果执行多条语句，可以写入脚本中 script.q
hive -f script.q
hive -S 不打印输出内容

二、HIVE的配置
HIVE的配置文件放置在$HIVE_INSTALL/conf目录中，同Hadoop一样，也是XML文件，可以指定--config参数，指定配置文件目录
如

hive --config=/usrs/home/hive-conf

也可以指定 HIVE_CONF_DIR环境变量来指定文件目录
1、HIVE的主要配置文件为hive-site.xml，其中主要修改目录存储的项目，如果用MYSQL，要做相应修改
与Hadoop相关联配置，均在此处修改

cp hive-default.xml.template hive-site.xml

2、修改env中部分环境变量，主要修改warehouse存储路径，默认的/user/hive/warehouse，修改为/usr/local/hive-0.11.0/warehouse

[root@zabbix conf]# grep -v "^\#" hive-env.sh
HADOOP_HOME=/usr/local/hadoop
export HIVE_CONF_DIR=/usr/local/hive-0.11.0/conf

修改hive-site.xml中的hive.metastore.warehouse.dir项
cp hive-default.xml.template hive-site.xml

hive.metastore.warehouse.dir/usr/local/hive-0.11.0/warehouselocation of default database for the warehouse

三、HIVE应用
根据特定的日志格式，导入HIVE中，首先按日志格式建表
日志格式为

1.1.1.1 m.c.com - - [08/Oct/2013:00:02:08 +0800] "GET /thread-559-1-1.html HTTP/1.1" 200 9006 "-" "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)"

建表

CREATE TABLE test(host STRING, web STRING,identity STRING, user STRING ,time STRING ,method STRING,request STRING,protocol STRING,status STRING, size STRING,referer STRING,agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\"[^ ]*) ([^ ]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s")STORED AS TEXTFILE;

导入数据时可以指定导入本地数据或HDFS日志文件
1、导入本地数据

LOAD DATA LOCAL INPATH '/usr/local/log/m.m.com_20131008.gz' OVERWRITE INTO TABLE test;

2、导入HDFS数据
先将本地日志文件复制到HDFS中

hadoop fs -put /usr/local/log/m.m.com_20131008.gz /usr/local/log/

导入HDFS日志

hive> LOAD DATA INPATH '/usr/local/log/m.m.com_20131008.gz' OVERWRITE INTO TABLE hdfstest;

查看表结构是否正确

hive> describe test;

当用"select count (*)"时报错，要引入JAR包

hive> add jar /usr/local/hive-0.11.0/lib/hive-contrib-0.11.0.jar ;
Added /usr/local/hive-0.11.0/lib/hive-contrib-0.11.0.jar to class path
Added resource: /usr/local/hive-0.11.0/lib/hive-contrib-0.11.0.jar

查找日志总数

hive> select count(*) from test;
...
OK
1099483
Time taken: 27.846 seconds, Fetched: 1 row(s)

查看排名前10的URL访问

hive> select request,count(request) as numrequest from test group by request sort by numrequest desc limit 10;
OK
/forum-103-1.html 17004
/forum-103-2.html 6023
/forum-103-3.html 2395
/forum-103-4.html 1380
/ 1375
/forum-103-5.html 1093
/forum-103-6.html 840
/cee/index.php/comment/news/more/5347782 797
/csdf/index.php/comment/news/more/5348055 788
/member.php?mod=logging&action=login 661
Time taken: 69.206 seconds, Fetched: 10 row(s)

2013-10-10 11:25 举报评分

全部回复( 0 )

回复帖子，请先登录或注册

编辑回复

HIVE分析日志

与内容相关的链接

全部回复( 0 )

发起人