The key issue of mining data on WEB is how to design an intelligent and effective spider.The paper analyzes the work flow and key technologies of the spider facing URL in details.It also brings forward the mind that adopting several queues to manage the URL list,in order to download HTML files in high speed we sort the URLs by document correlativity.Moreover,we import the idea of iterative threshold into computing document correlativity,which resolve the random modification of threshold.
参考文献
相似文献
引证文献
引用本文
张国平; 万仲保; 刘高原.基于轻量级J2EE框架信息发布系统的设计与实现[J].华东交通大学学报,2007,24(1):71-75. . Research and Realization of a Spider Model Facing URL[J]. JOURNAL OF EAST CHINA JIAOTONG UNIVERSTTY,2007,24(1):71-75