Open Access Open Access  Restricted Access Subscription Access

A Novel Parallel Method for Denoising and Deduplicating Mass Web Documents

Jin Liu,
Jun-Jie Song,
Lei Kong,
Jeong-Uk Kim,
Jin Wang,

Abstract


With rapid growth of web applications, search engines accumulate lots of duplicate data in their cloud archiving system which affects its data processing efficiency. Noise reduction and de-duplication technology can optimize storage to meet the growing demand for the mass data, especially for web search engines. This paper proposes a novel parallel noise reduction and deduplication method for web documents (PDNR). This novel method employs Map-Reduce computing model, the frequency statistics and a method of deleting duplicate web pages with hash table and inverted index. PDNR uses four Map-Reduce tasks to conduct the jobs of distributing data, noise reduction and deduplication. An experimental system is developed on Hadoop platform, and real world web pages are fed to the system to conduct the experiments. Result shows that PDNR can remove duplicate web pages efficiently and accurately.

Keywords


Web pages; Noise reduction; Deduplicationa

Citation Format:
Jin Liu, Jun-Jie Song, Lei Kong, Jeong-Uk Kim, Jin Wang, "A Novel Parallel Method for Denoising and Deduplicating Mass Web Documents," Journal of Internet Technology, vol. 17, no. 5 , pp. 889-896, Sep. 2016.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.





Published by Executive Committee, Taiwan Academic Network, Ministry of Education, Taipei, Taiwan, R.O.C
JIT Editorial Office, Office of Library and Information Services, National Dong Hwa University
No. 1, Sec. 2, Da Hsueh Rd., Shoufeng, Hualien 974301, Taiwan, R.O.C.
Tel: +886-3-931-7314  E-mail: jit.editorial@gmail.com