|
AWStats: Apache/IIS的日志分析工具——在GNU/Linux和Windows平台上的使用简介(6) --- > my $hr=($ix+1); if ($hr>12) { $hr=$hr-12; }
在Awstats 5.5以后中已经加入了针对中文主要搜索引擎的定义:这里是补充后的完整列表(包括了主要门户搜索和搜索门户) 62c60 < "baidu\.com","search\.sina\.com","search\.sohu\.com", --- > "baidu\.com","sina\.com","3721\.com","163\.com","tom\.com","sohu\.com",
153c144 < "baidu\.com","Word=", "search\.sina\.com", "word=", "search\.sohu\.com","word=", --- > "baidu\.com","word=", "sina\.com", "word=", "3721\.com", "name=","163\.com","q=","tom\.com","word=","sohu\.com","word=",
250c234 < "baidu\.com","Baidu", "search\.sina\.com","Sina", "search\.sohu\.com","Sohu", --- > "baidu\.com","Baidu", "sina\.com","Sina", "3721\.com","3721","163\.com","NetEase","tom\.com","Tom","sohu\.com","Sohu",
对Google的Unicode查询还是需要一些查询补丁: 因为Google对于Windows 2000以上的IE浏览器缺省发送的查询都是UTF-8格式的,而其他搜索引擎大部分使用的是系统本地编码:GB2312,因此需要将查询URI解码后,还要根据是否使用UTF-8进行到GB2312的转码,否则同样的单词会在统计中留有UTF-8和GB2312两条记录。
我增加了以下函数用于Google UTF-8字符的解码和类似于“\xc4\xbe\xd7\xd3\xc3\xc0”这样查询的解码 sub Utf8_To_Ascii { my $string = shift; my $encoding = shift;
# change \xc4\xbe\xd7\xd3\xc3\xc0 into %c4%be%d7%d3%c3%c0 $string =~ s/\\x(\w{2})/%\1/gi;
# uri unescape $string = uri_unescape($string);
if ( $string =~ m/^([\x00-\x7f][\xc2-\xdf][\x80-\xbf]\xe0[\xa0-\xbf][\x80-\xbf][\xe1-\xef][\x80-\xbf][\ x80-\xbf]\xf0[\x90-\xbf][\x80-\xbf][\x80-\xbf][\xf1-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf])*$/ ) { $string = decode("utf-8", $string);
|