Open Access
Subscription Access
基於網路拓撲的聚焦爬蟲研究
Abstract
聚焦爬蟲面向主題,過濾無關鏈結,只抓取相關的網頁資訊。通用的聚焦爬蟲,無法處理冗餘鏈結,因此本文提出了一種基於網路拓撲的聚焦爬蟲。從搜索引擎獲取初始網頁集,使用向量空間模型計算文本相似性。對抽取出的URL先進行鏈結分析,再根據無標度網路特徵,修正URL的權值。同時回饋不相關的主題區域,並通過URL與種子集合的距離設置不相關URL的緩衝區長度。仿真結果表明基於網路拓撲的爬蟲比通用爬蟲具有更高的查準率。Subject-oriented focued crawler, skips irrelevant links, and receives only relevant information. However, general fouced crawler couldn't deal with redundant links. This paper presents a kind of focused crawler based upon network topology. The crawler gets original URL sets from search engine, then calculates content similarity by the model of vector space. It analyzes link structure of websites, moreover modifies weight of URL according to the characteristic of scale-free network. Relevance feedback is used to disengage irrelevant regions, and the length of buffer is set for irrelevant URL by the distance between URL and seed pages. Experiments results prove that the precision of this focused crawler is higher than general crawler.
Keywords
聚焦爬蟲; 鏈結分析; 無標度網路; 向量空間; Focused Crawler; Link Analysis; Scale-free Network; Vector Space
Citation Format:
熊菲(Fei Xiong), 劉雲(Yun Liu), 李勇(Yong Li), "基於網路拓撲的聚焦爬蟲研究," Journal of Internet Technology, vol. 9, no. 5 , pp. 377-380, Dec. 2008.
熊菲(Fei Xiong), 劉雲(Yun Liu), 李勇(Yong Li), "基於網路拓撲的聚焦爬蟲研究," Journal of Internet Technology, vol. 9, no. 5 , pp. 377-380, Dec. 2008.
Full Text:
PDFRefbacks
- There are currently no refbacks.
Published by Executive Committee, Taiwan Academic Network, Ministry of Education, Taipei, Taiwan, R.O.C
JIT Editorial Office, Office of Library and Information Services, National Dong Hwa University
No. 1, Sec. 2, Da Hsueh Rd., Shoufeng, Hualien 974301, Taiwan, R.O.C.
Tel: +886-3-931-7314 E-mail: jit.editorial@gmail.com