Research on Algorithm for Informational Text De-Duplication_Vol. 1 No. 2 (JBDC 2023)_Journal of Big Data and Computing (ISSN: 2959-0590)

Home > Journal of Big Data and Computing (ISSN: 2959-0590) > Vol. 1 No. 2 (JBDC 2023) >

Research on Algorithm for Informational Text De-Duplication

DOI: https://doi.org/10.62517/jbdc.202301205

Author(s)

Linqing Deng1, Yingying Li1,2, Jie Hu1

Affiliation(s)

1 School of Software, Shanxi Agricultural University, Jinzhong 030801, China 2 Mixed and Virtual Reality Research Lab, Vicubelab, Faculty of Engineering, Universiti Teknologi Malaysia (UTM), Johor Bahru, Johor, Malaysia

Abstract

In recent years, with the continuous development of China's science and technology and computer science and technology, the technology of transmitting various information in the form of such text as Chinese and short text has been developing and popularizing, like Micro-blog, and WeChat official account. The continuous increase in the dissemination of short text information provide various resources for information decision-making and information, but there has also been a large amount of redundancy, especially in the case of invalid and repetitive information for informational texts. In such a large and repetitive information set, the storage capacity of the system is heavily occupied, which is not conducive to the collection and extraction of effective information and data from informational texts, seriously affecting the accuracy of information decision-making and affecting the timeliness of information. Therefore, it is necessary to strengthen the research on method for informational text de-duplication in this context. Taking informational text de-duplication as an example, this paper analyzes the current research on technology for text de-duplication at home and abroad, and conducts research on methods for informational text de-duplication based on relevant technologies, in order to provide certain reference ideas for enterprises when carrying out informational text de-duplication.

Keywords

Informational Text; De-Duplication Algorithm; De-Duplication Technology

References

[1] Wang Jinyun, Xiang Yang. Research on Text Semantic De-duplication Algorithm Based on Keyword Graph Representation [J/OL]. Computer Applications: 1-8. [2] Zhang Yanan, Chen Weiwei, Fu Yinjin, et al. Research on Text De-duplication Algorithm Based on Sim Hash [J]. Computer Technology and Development, 2022, 32 (08): 26-32. [3] Yao Qingfeng. Research on Text Processing and Mining Algorithms for Social Media [D]. Beijing University of Posts and Telecommunications, 2022. [4] Wang Tiannan, Feng Feng. Research on Text Similarity Detection Algorithm Based on Sim Hash [J]. Electronic Testing, 2019, (15): 87-89. [5] Zhang Hang, Sheng Zhiwei, Zhang Shibin et al. The Application of Sim Hash Algorithm in Text De-duplication [J]. Computer Engineering and Applications, 2020, 56 (11): 246-251. [6] Cai Yanjing. Research on Text Similarity Deparallelization Algorithm [J]. Electronic Production, 2018(10): 35-37.