This Article 
 Bibliographic References 
 Add to: 
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching
June 2005 (vol. 17 no. 6)
pp. 859-874
Constructing Web pages from fragments has been shown to provide significant benefits for both content generation and caching. In order for a Web site to use fragment-based content generation, however, good methods are needed for fragmenting the Web pages. Manual fragmentation of Web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in Web sites serving dynamic content. Our approach analyzes Web pages with respect to their information sharing behavior, personalization characteristics, and change patterns. We identify fragments which are shared among multiple documents or have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a framework for fragment detection, which includes a hierarchical and fragment-aware model for dynamic Web pages and a compact and effective data structure for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. This paper shows the results when the algorithms are applied to real Web sites. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of using the fragments detected by our system on key parameters such as disk space utilization, network bandwidth consumption, and load on the origin servers.

[1] Document Object Model— W3C Recommendation, http://www.w3.orgDOM, 2005.
[2] Edge Side Includes— Standard Specification, http:/, 2005.
[3] HTML TIDY,, 2005.
[4] H. Bahn, H. Lee, S.H. Noh, S.L. Min, and K. Koh, “Replica-Aware Caching for Web Proxies,” Computer Comm., vol. 25, no. 3, 2002.
[5] Z. Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and Its Applications,” Proc. 11th Int'l World Wide Web Conf., 2002.
[6] A. Broder, “On Resemblance and Containment of Documents,” Proc. SEQUENCES-97, 1997.
[7] A. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth Int'l World Wide Web Conf., 1997.
[8] D. Buttler and L. Liu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. 21st Int'l Conf. Distributed Computing Systems, 2001.
[9] K.S. Candan, D. Agrawal, W.-S. Li, O. Po, and W.-P. Hsiung, “View Invalidation for Dynamic Content Caching in Multi Tiered Architectures,” Proc. 28th Int'l Conf. Very Large Databases, 2002.
[10] J. Challenger, A. Iyengar, and P. Dantzig, “A Scalable System for Consistently Caching Dynamic Web Data,” Proc. IEEE INFOCOM 1999, 1999.
[11] J. Challenger, A. Iyengar, K. Witting, C. Ferstat, and P. Reed, “Publishing System for Efficiently Creating Dynamic Web Content,” Proc. IEEE INFOCOM 2000, 2000.
[12] M.C. Chan and T.W. C. Woo, “Cache-Based Compaction: A New Technique for Optimizing Web Transfer,” Proc. IEEE INFOCOM 1999, 1999.
[13] G. Cobena, S. Abiteboul, and A. Marian, “Detecting Changes in XML Documents,” Proc. 18th Int'l Conf. Data Eng., 2002.
[14] A. Datta, K. Dutta, H. Thomas, D. VanderMeer, Suresha, and K. Ramamritham, “Proxy-Based Acceleration of Dynamically Generated Content on the World Wide Web: An Approach and Implementation,” Proc. ACM SIGMOD Int'l Conf. Management of Data, June 2002.
[15] F. Douglis and A. Iyengar, “Application-Specific Delta Encoding Via Resemblance Detection,” Proc. USENIX Ann. Technical Conf., 2003.
[16] X.-D. Gu, J. Chen, W.-Y. Ma, and G.-L. Chen, “Visual Based Content Understanding towards Web Adaptation,” Proc. Second Int'l Conf. Adaptive Hypermedia and Adaptive Web Based Systems, 2002.
[17] T. Kelly and J. Mogul, “Aliasing on the World Wide Web: Prevalence and Performance Implications,” Proc. 11th Int'l World Wide Web Conf., 2002.
[18] P. Kulkarni, F. Douglis, J. LaVoie, and J. Tracey, “Redundancy Elimination Within Large Collections of Files,” Proc. USENIX Ann. Technical Conf., 2004.
[19] U. Manber, “Finding Similar Files in a Large File System,” Proc. USENIX Winter 1994 Technical Conf., 1994.
[20] J. Mogul, “Network Behavior of a Busy Web Server and Its Clients,” technical report, DEC Western Research Laboratories, 1995.
[21] J. Mogul, Y. Chan, and T. Kelly, “Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP,” Proc. First Symp. Networked Systems Design and Implementation, 2004.
[22] P. Mohapatra and H. Chen, “A Framework for Managing QoS and Improving Performance of Dynamic Web Content,” Proc. IEEE Global Telecomm. Conf., 2001.
[23] M. Naaman, H. Garcia-Molina, and A. Paepcke, “Evaluation of ESI and Class-Based Delta Encoding,” Proc. Eighth Int'l Workshop Web Content Caching and Distribution, 2003.
[24] Z. Ouyang, N.D. Memon, T. Suel, and D. Trendafilov, “Cluster-Based Delta Compression of a Collection of Files,” Proc. Third Int'l Conf. Web Information Systems Eng., 2002.
[25] M.O. Rabin, “Fingerprinting by Random Polynomials,” technical report, Center for Research in Computing Technology, Harvard Univ., 1981.
[26] S.C. Rhea, K. Liang, and E. Brewer, “Value-Based Web Caching,” Proc. 12th Int'l World Wide Web Conf., 2003.
[27] C. Wills and M. Mikhailov, “Studying the Impact of More Complete Server Information on Web Caching,” Proc. Fifth Int'l Web Caching and Content Delivery Workshop, 2000.

Index Terms:
Dynamic content caching, fragment-based caching, fragment detection.
Lakshmish Ramaswamy, Arun Iyengar, Ling Liu, Fred Douglis, "Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 859-874, June 2005, doi:10.1109/TKDE.2005.89
Usage of this product signifies your acceptance of the Terms of Use.