60.28.204.0 抓虾
61.135.163.0 百度
61.135.216.0 有道
65.55.106.0 微软
65.55.207.0 微软
65.55.211.0 微软
66.249.66.0 Google
72.14.199.0 Google
121.0.29.0 阿里巴巴
123.125.66.0 百度
124.115.10.0 腾讯搜搜
124.115.11.0 腾讯搜搜
124.115.12.0 腾讯搜搜
203.208.60.0 Google
209.85.238.0 Google
219.239.34.0 鲜果
220.181.50.0 百度
220.181.61.0 搜狗
60.28.204.0 抓虾
61.135.163.0 百度
61.135.216.0 有道
65.55.106.0 微软
65.55.207.0 微软
65.55.211.0 微软
66.249.66.0 Google
72.14.199.0 Google
121.0.29.0 阿里巴巴
123.125.66.0 百度
124.115.10.0 腾讯搜搜
124.115.11.0 腾讯搜搜
124.115.12.0 腾讯搜搜
203.208.60.0 Google
209.85.238.0 Google
219.239.34.0 鲜果
220.181.50.0 百度
220.181.61.0 搜狗
最后我们还需要准备一个IP地址库,对于那些被我们揪出来的爬虫,我们还需要甄别一下他的身份,它究竟是一个恶意的爬虫,还是一个未被我们放入白名单的合法爬虫呢?IP地址库很容易从互联网下载一份,所以也不展开讨论了。总之有了这些素材,我们要甄别网络爬虫就十分简单了,仅仅十几行ruby代码就搞定了:
Ruby代码
whitelist = []
IO.foreach("#{RAILS_ROOT}/lib/whitelist.txt") { |line| whitelist << line.split[0].strip if line }
realiplist = []
IO.foreach("#{RAILS_ROOT}/log/visit_ip.log") { |line| realiplist << line.strip if line }
iplist = []
IO.foreach("#{RAILS_ROOT}/log/stat_ip.log") do |line|
ip = line.split[1].strip
iplist << ip if line.split[0].to_i > 3000 && !whitelist.include?(ip) && !realiplist.include?(ip)
end
Report.deliver_crawler(iplist)
whitelist = []
IO.foreach("#{RAILS_ROOT}/lib/whitelist.txt") { |line| whitelist << line.split[0].strip if line }
realiplist = []
IO.foreach("#{RAILS_ROOT}/log/visit_ip.log") { |line| realiplist << line.strip if line }
iplist = []
IO.foreach("#{RAILS_ROOT}/log/stat_ip.log") do |line|
ip = line.split[1].strip
iplist << ip if line.split[0].to_i > 3000 && !whitelist.include?(ip) && !realiplist.include?(ip)
end