Detection of Unnatural Parts of Statistical Data

  • Tetsuya Nakatoh Nakamura Gakuen University http://orcid.org/0000-0001-9299-7757
  • Takahiko Suzuki Kyushu University
  • Tsukasa Kamimasu Kyushu University
  • Sachio Hirokawa Advanced Institute of Industrial Technology

Abstract

Ensuring the authenticity of statistical data is important because such data are used for various decision-making tasks. However, in practical applications, several types of data alterations have been reported. Therefore, it is necessary to validate the accuracy of statistical data. Benford's law is a well-known method for detecting unnatural numerical data. According to Benford's law, the occurrence probability of the first significant digits follows a particular distribution. However, the unnatural parts of data cannot be accurately identified. In this study, we attempted to identify the unnatural parts of statistical data available in tabular format. A subset of the target data was specified using the row and column names that define each cell in the table or the words displayed in the table title. By measuring the divergence of the subsets, we identified the unnatural subsets. In this paper, we present the results of the identification of unnatural subsets using the agricultural data acquired from the China Statistical Yearbook.

Published
2020-12-30
Section
Technical Papers (Data Science & Institutional Research)