The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - Nov. (2012 vol.61)
pp: 1651-1664
H. Howie Huang , The George Washington University, Washington, DC
Nan Zhang , The George Washington University, Washington, DC
Wei Wang , University of Delaware, Newark
Gautam Das , University of Texas at Arlington, Arlington
Alexander S. Szalay , Johns Hopkins University, Baltimore
ABSTRACT
As file systems reach the petabytes scale, users and administrators are increasingly interested in acquiring high-level analytical information for file management and analysis. Two particularly important tasks are the processing of aggregate and top-k queries which, unfortunately, cannot be quickly answered by hierarchical file systems such as ext3 and NTFS. Existing preprocessing-based solutions, e.g., file system crawling and index building, consume a significant amount of time and space (for generating and maintaining the indexes) which in many cases cannot be justified by the infrequent usage of such solutions. In this paper, we advocate that user interests can often be sufficiently satisfied by approximate—i.e., statistically accurate—answers. We develop Glance, a just-in-time sampling-based system which, after consuming a small number of disk accesses, is capable of producing extremely accurate answers for a broad class of aggregate and top-k queries over a file system without the requirement of any prior knowledge. We use a number of real-world file systems to demonstrate the efficiency, accuracy, and scalability of Glance.
INDEX TERMS
Estimation, Aggregates, Indexes, Accuracy, Query processing, Calculators, History, file systems, Estimation, Aggregates, Indexes, Accuracy, Query processing, Calculators, History, Data analytics, Estimation, Aggregates, Indexes, Accuracy, Query processing, Calculators, History
CITATION
H. Howie Huang, Nan Zhang, Wei Wang, Gautam Das, Alexander S. Szalay, "Just-in-Time Analytics on Large File Systems", IEEE Transactions on Computers, vol.61, no. 11, pp. 1651-1664, Nov. 2012, doi:10.1109/TC.2011.186
REFERENCES
[1] N. Agrawal, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, “Generating Realistic Impressions for File-System Benchmarking,” ACM Trans. Storage, vol. 5, no. 4, pp. 1-30, 2009.
[2] N. Agrawal, W. Bolosky, J. Douceur, and J. Lorch, “A Five-Year Study of File-System Metadata,” Proc. Fifth USENIX Conf. File and Storage Technologies, pp. 31-45, 2007.
[3] S. Ames, M. Gokhale, and C. Maltzahn, “Design and Implementation of a Metadata-Rich File System,” Technical Report UCSC-SOE-10-07, Univ. of California, Santa Cruz, 2010.
[4] D. Barbara, “The New Jersey Data Reduction Report,” IEEE Data Eng. Bull., vol. 20, no. 4, pp. 3-45, Dec. 1997.
[5] Beagle, http:/beagle-project.org/, 2011.
[6] J. Bethel, “Sample Allocation in Multivariate Surveys,” Survey Methodology, vol. 15, no. 1, pp. 47-57, 1989.
[7] S. Brandt, C. Maltzahn, N. Polyzotis, and W.-C. Tan, “Fusing Data Management Services with File Systems,” Proc. Fourth Ann. Workshop Petascale Data Storage (PDSW '09), pp. 42-46, 2009.
[8] J. Callan and M. Connell, “Query-Based Sampling of Text Databases,” ACM Trans. Information Systems, vol. 19, pp. 97-130, Apr. 2001.
[9] B. Causey, “Computational Aspects of Optimal Allocation in Multivariate Stratified Sampling,” SIAM J. Scientific and Statistical Computing, vol. 4, pp. 322-329, 1983.
[10] S. Chaudhuri, G. Das, and V. Narasayya, “Optimized Stratified Sampling for Approximate Query Processing,” ACM Trans. Database Systems, vol. 32, no. 2, p. 9, 2007.
[11] S. Chaudhuri, G. Das, and U. Srivastava, “Effective Use of Block-Level Sampling in Statistics Estimation,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 287-298, 2004.
[12] J. Chromy, “Design Optimization with Multiple Objectives,” Proc. Research Methods of the Am. Statistical Assoc., pp. 194-199, 1987.
[13] CNet, “Security Guide to Customs-Proofing Your Laptop,” http://news.cnet.com8301-13578_3-9892897-38.html , 2009.
[14] W. Cochran, Sampling Techniques. John Wiley & Sons, 1977.
[15] G. Das, “Survey of Approximate Query Processing Techniques (Tutorial),” Proc. Int'l Conf. Scientific and Statistical Database Management (SSDBM '03), 2003.
[16] A. Dasgupta, G. Das, and H. Mannila, “A Random Walk Approach to Sampling Hidden Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '07), pp. 629-640, 2007.
[17] A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das, “Unbiased Estimation of Size and Other Aggregates over Hidden Web Databases,” Proc. Int'l Conf. Management of Data (SIGMOD), pp. 855-866, 2010.
[18] A. Dasgupta, N. Zhang, and G. Das, “Leveraging Count Information in Sampling Hidden Databases,” Proc. IEEE Int'l Conf. Data Eng., pp. 329-340, 2009.
[19] A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri, “Privacy Preservation of Aggregates in Hidden Databases: Why and How?” Proc. 35th SIGMOD Int'l Conf. Management of Data, pp. 153-164, 2009.
[20] H.A. David and H.N. Nagaraja, Order Statistics, third ed. Wiley, 2003.
[21] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer, “Passive Nfs Tracing of Email and Research Workloads,” Proc. Second USENIX Conf. File and Storage Technologies (FAST '03), pp. 203-216, 2003.
[22] M.N. Garofalakis and P.B. Gibbon, “Approximate Query Processing: Taming the Terabytes,” Proc. 27th Int'l Conf. Very Large Data Bases (VLDB), 2001.
[23] Google, Google Desktop, http:/desktop.google.com/, 2011.
[24] Y.L. Hedley, M. Younas, A. James, and M. Sanderson, “A Two-Phase Sampling Technique for Information Extraction from Hidden Web Databases,” Proc. Sixth Ann. ACM Int'l Workshop Web Information and Data Management (WIDM '04), pp. 1-8, 2004.
[25] Y.L. Hedley, M. Younas, A.E. James, and M. Sanderson, “Sampling, Information Extraction and Summarisation of Hidden Web Databases,” Data and Knowledge Eng., vol. 59, no. 2, pp. 213-230, 2006.
[26] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, “SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems,” Proc. Conf. High Performance Computing Networking, Storage and Analysis (SC), pp. 1-12, 2009.
[27] H. Huang, N. Zhang, W. Wang, G. Das, and A. Szalay, “Just-in-Time Analytics on Large File Systems,” Proc. Ninth USENIX Conf. File and Storage Technologies, 2011.
[28] L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satyanarayanan, G. Ganger, E. Riedel, and A. Ailamaki, “Diamond: A Storage Architecture for Early Discard in Interactive Search,” Proc. USENIX Conf. File and Storage Technologies (FAST), 2004.
[29] I.F. Ilyas, G. Beskales, and M.A. Soliman, “A Survey of Top-$k$ Query Processing Techniques in Relational Database Systems,” ACM Computing Surveys, vol. 40, no. 4, pp. 1-58, 2008.
[30] P.G. Ipeirotis and L. Gravano, “Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection,” Proc. 28th Int'l Conf. Very Large Data Bases (VLDB '02), pp. 394-405, 2002.
[31] P. Kogge, “Exascale Computing Study: Technology Challenges in Achieving Exascale Systems,” DARPA Information Processing Techniques Office, vol. 28, 2008.
[32] A. Leung, “Organizing, Indexing, and Searching Large-Scale File Systems,” Technical Report UCSC-SSRC-09-09, Univ. of California, Santa Cruz, Dec. 2009.
[33] A. Leung, I. Adams, and E. Miller, “Magellan: A Searchable Metadata Architecture for Large-Scale File Systems,” Technical Report UCSC-SSRC-09-07, Univ. of California, Santa Cruz, Nov. 2009.
[34] A.W. Leung, M. Shao, T. Bisson, S. Pasupathy, and E.L. Miller, “Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems,” Proc. Seventh Conf. File and Storage Technologies (FAST), pp. 153-166, 2009.
[35] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble, “Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality,” Proc. Seventh Conf. File and Storage Technologies (FAST), pp. 111-123, 2009.
[36] L. Liu, L. Xu, Y. Wu, G. Yang, and G. Ganger, “Smartscan: Efficient Metadata Crawl for Storage Management Metadata Querying in Large File Systems,” Carnegie Mellon Univ. Parallel Data Lab Technical Report CMU-PDL-10-112, 2010.
[37] S. Lohr, Sampling: Design and Analysis. Cengage Learning, 1999.
[38] N. Murphy, M. Tonkelowitz, and M. Vernal, “The Design and Implementation of the Database File System,” http://citeseerx.ist.psu.edu/viewdocsummary?doi=10.1.1.11.8068 , 2002.
[39] J. Nunez, “High End Computing File System and IO R&D Gaps Roadmap,” Proc. HEC FSIO R&D Conf., Aug. 2008.
[40] F. Olken and D. Rotem, “Simple Random Sampling from Relational Databases,” Proc. 12th Int'l Conf. Very Large Data Bases, pp. 160-169, 1986.
[41] F. Olken and D. Rotem, “Random Sampling from Database Files: A Survey,” Proc. Fifth Int'l Conf. Statistical and Scientific Database Management, pp. 92-111, 1990.
[42] M. Olson, “The Design and Implementation of the Inversion File System,” Proc. Winter 1993 USENIX Technical Conf., pp. 205-217, 1993.
[43] R. Pike, D. Presotto, S. Dorward, B. Flandrena, K. Thompson, H. Trickey, and P. Winterbottom, “Plan 9 from Bell Labs,” Computing Systems, vol. 8, no. 3, pp. 221-254, 1995.
[44] Plan 9 File System Traces, http://pdos.csail.mit.edup9trace/, 2011.
[45] M. Seltzer and N. Murphy, “Hierarchical File Systems Are Dead,” Proc. 12th Conf. Hot Topics in Operating Systems (HotOS '09), p. 1, 2009.
[46] SlashDot, “Laptops Can Be Searched at the Border,” http://yro.slashdot.org/article.pl?sid=08/ 04/221733251, 2008.
[47] SNIA, NFS Traces, http://iotta.snia.org/traces/listNFS, 2010.
[48] P. Stahlberg, G. Miklau, and B.N. Levine, “Threats to Privacy in the Forensic Analysis of Database Systems,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '07) pp. 91-102, 2007.
[49] A. Szalay, “New Challenges in Petascale Scientific Databases,” Proc. 20th Int'l Conf. Scientific and Statistical Database Management (SSDBM '08), p. 1, 2008.
[50] J. Vitter, “Random Sampling with a Reservoir,” ACM Trans. Math. Software, vol. 11, no. 1, pp. 37-57, 1985.
[51] Y. Zhu, H. Jiang, J. Wang, and F. Xian, “HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 6, pp. 750-763, June 2008.
24 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool