A  Study of NLPDedup Efficiency for Small and Large  Datasets

Hinata Yokoyama; Kazuma Iwamoto; Ichitoshi Takehara; Kazuaki Ando; Hitoshi Kamei

doi:10.52731/ijskm.v9.i1.891

Hinata Yokoyama Kagawa University
Kazuma Iwamoto Kagawa University
Ichitoshi Takehara Kagawa University
Kazuaki Ando Kagawa University
Hitoshi Kamei Kagawa University

DOI: https://doi.org/10.52731/ijskm.v9.i1.891

Keywords: Data deduplication, Natural Language Processing, Levenshtein distance, N-gram

Abstract

Recently, the amount of data has grown rapidly. The deduplication functions reduce the amount of data by finding and deleting the redundant data. Meanwhile, to check data redundancy, the functions affect the performance because they issue many read I/Os and compare data. To mitigate the performance penalty, it is effective to narrow down processed files. Conventional methods use file metadata and hash values generated from file contents as indicators. However, if many files are stored in a file system, the methods are not efficient because of high load caused by checking metadata and hash value calculation. We propose a novel method, called NLPDedup, to narrow down files by using natural language processing for data deduplication functions. NLPDedup uses file names as indicators for narrowing down target files. This paper describes the overview of NLPDedup, how NLPDedup determines the target files, and the evaluation results of small and large datasets. From the results, the threshold of NLPDedup indicators needs to be set in terms of the natural language processing algorithms and the datasets. Consequently, we found that NLPDedup is effective in both datasets, and it is more effective by setting appropriate thresholds.

References

P. Shilane et al., “Delta Compressed and Deduplicated Storage Using Stream-Informed Locality,” Proc. 4th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage '12), 2012, pp. 1-5.

Y. Zhang et al., “LoopDelta: Embedding Locality-aware Opportunistic Delta Com-pression in Inline Deduplication for Highly Efficient Data Reduction,” Proc. 2023 USENIX Annual Technical Conference (USENIX ATC '23), 2023, pp. 133-148.

L. L. You, K. T. Pollack, and D. D. E Long, “Deep Store: an archival storage system architecture,” Proc. 21st International Conference on Data Engineering (ICDE'05), IEEE, 2005, doi: 10.1109/ICDE.2005.47

P. Kulkarni et al., “Redundancy Elimination Within Large Collections of Files,” Proc. 2004 USENIX Annual Technical Conference (USENIX ATC '04), 2004, pp. 59-72.

P. Shilane et al., “WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression,” Proc. 10th USENIX Conference on File and Storage Technologies (FAST '12), 2012, pp. 49-63.

N. Jain et al., “TAPER: Tiered Approach for Eliminating Redundancy in Replica Synchronization,” Proc. 4th USENIX Conference on File and Storage Technologies (FAST '05), 2005, pp. 281-294.

M. Dutch, “Understanding data deduplication ratios,” SNIA, Jun. 2008, https://www.snia.org/sites/default/files/Understanding_Data_Deduplication_Ratios-20080718.pdf

J. Qiu et al., “Light-Dedup: A Light-weight Inline Deduplication Framework for Non-Volatile Memory File Systems,” Proc. 2023 USENIX Annual Technical Con-ference (USENIX ATC '23), 2023, pp. 101-116.

I. Kotlarska et al., “InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication,” Proc. 21st USENIX Conference on File and Storage Technologies (FAST '23), 2023, pp. 33-48.

X. Zou et al., “Building a High-performance Fine-grained Deduplication Frame-work for Backup Storage with High Deduplication Ratio,” Proc. 2022 USENIX Annual Technical Conference (USENIX ATC '22), 2022, pp. 19-36.

D. T. Meyer et al., “A Study of Practical Deduplication,” Proc. 9th USENIX Con-ference on File and Storage Technologies (FAST '11), 2011, pp. 1-13.

H. Yokoyama et al., “NLPDedup: Using Natural Language Processing for Data Deduplication,” Proc. IIAI AAI 2024 16th International Congress on Advanced Applied Informatics (IIAI-AAI '24), 2024, p. 115-120, 10.1109/IIAI-AAI63651.2024.00031.

H. Yokoyama et al., “A Method of Duplicate Data Detection by File Names as Feature Value,” The 86th National Convention of IPSJ, 2024, pp.459-460.

J. Lu et al., “String similarity measures and joins with synonyms,” Proc. 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13), ACM, 2013, doi: 10.1145/2463676.2465313.

WAP Tokushima NLP Resources, “Sudachi,” https://worksapplications.github.io/Sudachi/

National Institute for Japanese Language and Linguistics, “Electronic Dictionary with Uniformity and Identity. UniDic,” https://clrd.ninjal.ac.jp/unidic/

WAP Tokushima NLP Resources, “SudachiDict,” https://github.com/WorksApplications/SudachiDict/.

Kagawa prefecture, “Open Data KAGAWA,” https://opendata.pref.kagawa.lg.jp/dataset/.

H. Yokoyama et al., “Acceleration of Deduplication Processing Using File Name Similarity,” 2023 Shikoku-Section Joint Convention Record of the Institutes of Electrical and Related Engineers, 2023, p.199.

Digital Agency, “e-gov Data Portal,” https://data.e-gov.go.jp/info/ja.

I. Kotlarska et al., “InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication,” Proc. 21th USENIX Conference on File and Storage Technologies (FAST '23), 2023, pp. 33-48.

N. Elias et al., “DedupSearch: Two-Phase Deduplication Aware Keyword Search,” Proc. 20th USENIX Conference on File and Storage Technologies (FAST '22), 2022, pp. 233-246.

Y. Pan et al., “Don't Maintain Twice, It's Alright: Merged Metadata Management in Deduplication File System with GogetaFS,” Proc. 23th USENIX Conference on File and Storage Technologies (FAST '25), 2025, pp. 479-495.