Document Type : Research Paper
Authors
1 Master's degree in Artificial Intelligence, Faculty of Electrical and Computer Engineering, Malek Ashtar Industrial University, Tehran, Iran
2 Assistant Professor, Faculty of Electrical and Computer Engineering, Malek Ashtar Industrial University, Tehran, Iran
Abstract
Keywords
Note: Figures, tables, formulas, and diagrams are not displayed in the HTML version of this article. For the complete and properly formatted content, please download and refer to the PDF version.
Constructing a semantic network requires the integration of multiple components. The approach employed in this paper focuses on the automatic construction of a semantic network. Modern approaches to computerized Qur’an mining were established in the second half of the 20th century by Muslim scholars, in parallel with advancements in computational processing equipment. In the early years of the 21st century, as natural language processing and semantic computing methods flourished, these approaches gained greater coherence and diversity. From 2010 onward, with the emergence of novel artificial intelligence techniques such as deep neural networks, these methods were increasingly employed to further expand this research domain. Following the widespread introduction and success of large language models (LLMs) between 2017 and 2021, and continuing to the present, researchers have adopted methods based on this computational infrastructure—such as machine comprehension of texts, automatic machine tagging, and prompt engineering techniques—within the field of computerized Qur’an mining.
The knowledge of Qur’anic exegesis encompasses a substantial corpus of profound insights and wisdom preserved in interpretive texts. Semantic—and even cognitive—processing of this valuable corpus can lead to significant advancements in understanding the Holy Qur’an through computerized Qur’an mining. To achieve this objective, robust semantic processing infrastructures must be applied to existing exegetical texts, whether through human, semi-automatic, or fully automatic methods. Among the most important of these semantic infrastructures are semantic networks and ontologies. In the present paper, we propose a multifaceted solution for the automatic generation of an abductive semantic network for the Holy Qur’an, employing a hybrid engineering approach. This solution integrates advances from information engineering, information retrieval, natural language processing, word embedding systems, and semantic-cognitive computing.
Ontologies serve as formal and explicit representations of conceptual structures and are essential tools in semantic networks. An ontology comprises several components, including concepts (classes) (C), relations (properties) (R), instances (I), axioms (A), data types (T), and values (V) (Gruber 1993). Ontology relations are divided into two main categories: taxonomic relations and non-taxonomic relations. Non-taxonomic relations encompass several subcategories, including Part-of, Antonymy, Synonymy, Possession, Causality, Hypernymy, and Hyponymy.
In the Persian language, a number of ontologies have been developed. Some of these are domain-specific, as outlined below:
The table below (Table 1) lists several prominent studies on computational ontology development for the Holy Qur’an conducted between 2023 and 2025. These studies emphasize ontology engineering, ontology modeling, text mining, and natural language processing. They originate from universities and research institutes in Germany, Indonesia, Egypt, Turkey, Iraq, Pakistan, Iran, China, Malaysia, Japan, and the UK, demonstrating the global scope of research efforts in this field. Specific references corresponding to each study title are provided in the references section.
Table 1. Several prominent studies on computational ontology development for the Qur’an (2023–2025)
In the Amid Dictionary (2010), three meanings are provided for the term tadāʿī (abduction): A. The principle or state in which thoughts, ideas, emotions, and experiences become interconnected, such that they emerge sequentially in the mind; a chain of thoughts or notions; B. Calling upon one another and gathering together; C. Recalling or remembering. Moreover abduction (tadāʿī) is defined as calling upon one another, and in psychology, it refers to the relationship between a phenomenon and its associated thoughts (Pournamdarian & Tehrani Sabet 2010). Lobo describes abduction as a process for explaining the observable influences of phenomena in the universe (Lobo & Uzcátegui 1997). Ernest defines abductive relations as the chain of antecedents and consequents of an expression (Ernest 2023).
As is evident from the meaning of the term “abduction” (tadāʿī), the intended sense in this paper is the recalling of a word as a result of seeing or hearing another word. In English, two equivalents exist for tadāʿī: in the context of logic, the term “abduction” is used, whereas in the context of psychology, the term “association” is employed.
Numerous methods have been employed to construct or develop ontologies, semantic networks, and associative structures. Mohammadi and Badie, in their article, proposed a method for extracting concept chains and assigning scores to them. First, the text is segmented and semantically parsed. Then, concept chains are extracted. Finally, key concepts are identified using the assigned scores and a predefined threshold (Mohammadi & Badie 2017). Ahmadi et al. (2017) utilized a lexical co-occurrence method to extract concept hierarchies in the field of scientometrics. Mousavi et al. (2017) presented a method for constructing a Persian ontology by establishing links between Persian words and PWN (Princeton WordNet). This WordNet contains 16,000 words and 22,000 synsets. The accuracy reported in this study was 91.18%. Mousavi and Faili (2021) improved upon their previous work by adding Persian compound verbs to the existing ontology. The method employed in this study also relied on supervised learning.
Humans typically present their observations based on their background knowledge. This process resembles abduction more closely than deductive reasoning, as it requires assumptions that are not explicitly present in the observation itself. Humans possess the ability to comprehend complex situations, a capability that does not necessarily arise directly from observation (Langley & Meadows 2019). Al-Salhi and Abdulla (2022) introduced a method based on domain-specific language mapping for the automatic construction of a Qur’anic ontology. Ghayoomi (2019) proposed a method for automatically determining word meanings based on word-embedding vectors. For each target word, two vectors were constructed: one representing the word itself and another representing the contextual environment in which the word appears. Soliman et al. (2017) presented a trained word-embedding model using a dataset that included Twitter data, web data, and Wikipedia content. The technique employed in this study was Word2Vec.
In this section, after examination of the scope and limitations of prior approaches, the proposed technical solution is introduced. It is noteworthy that the rationale underlying the engineering decisions made during the development of the proposed solution, which constitutes a significant portion of this paper’s technical contribution from a method-engineering perspective, is explained at each step of the processing pipeline in this section.
Figure 1 illustrates the overall workflow of the proposed approach. First, the text of Tafsir Nūr is converted into a structured dataset, and its constituent words are extracted. Then, TF-IDF is applied, and a threshold is used to select the final set of relevant words. In the next stage, and in parallel for these selected words, a co-occurrence matrix, word-embedding vectors, Persian FarsNet ontology relations, and connections to Arabic root extraction are obtained. Finally, based on the relations derived in the previous stage, abduction pairs and abduction frames are generated.
Figure 1. General Method of the Proposed Approach
This process (co-occurrence matrix, word embedding vector, Persian FarsNet ontological relations, connection to Arabic text root-finding, and finally clustering) was conducted in three cases:
To prepare the corpus, the Tafsir Nūr (Qara'ati 2004) was selected. The reason for using this Tafsir is its simple and fluent text, which makes the concepts of the Qur’an accessible to the general public. The source of the corpus is the Islamic Encyclopedia website (https://wiki.ahlolbait.com).
The Qur’an Comprehensive Database website (https://quran.inoor.ir) provides thematic categorization of verses. Using this site, verses related to the Hereafter were selected, comprising approximately 1600 records.
After constructing the corpus from the Tafsir texts, preprocessing was performed. Preprocessing consisted of four stages: normalization, removal of extra punctuation marks, stop word removal, and lemmatization. The Hazm library was used for the preprocessing stages. After preprocessing, the number of words decreased from approximately 19000 to 17000. Subsequent processing stages were conducted separately for the three word categories. After removing low-frequency words (and applying TF-IDF thresholding), the number of words reduced to approximately 4000.
The co-occurrence matrix, a square matrix representing the probability of words appearing together within a window size of 5, was computed.
The two main approaches in word embedding systems are CBOW and Skip-gram (Hinton 1986; Mikolov et al. 2013). Commonly used models include word2vec, GloVe, and FastText (Chawla 2018). This study utilized two pre-trained word embedding models: AminMozhgani and FarsiYar. The Mozhgani model (n.d.) is based on the word2vec word embedding trained on the Persian Wikipedia 2016 corpus (wikipedia_fa_all_nopic_2016-12.zim). Approximately 2000 words were extracted from this model. The FarsiYar (n.d.) corpus collection provides several services for Persian language processing (https://text-mining.ir/corpus). Among the multiple word embedding models available in this corpus, the GloVe model was selected. Approximately 2900 words were extracted from this model.
After extracting the word embedding vectors (separately for each of the two models), cosine similarity measurement was performed. The meaningfulness of high-scoring word pairs indicated the validity of the proposed models (e.g., the word pair "Hazrat" and "Hassan" from the Mozhgani model, and the pair "adab" and "rusum" from the GloVe model had high scores).
FarsNet was selected as the best and most comprehensive Persian WordNet. This ontology offers several advantages over other Persian ontologies:
Using the FarsNet web-service, synsets exactly matching the words were retrieved and stored.
For each word, the corresponding synset ID, sense ID, synset text, noun category, verb past stem, verb present stem, verb type, semantic category, and part of speech were stored. Through matrix processing and utilization of the FarsNet ontology, relations among words in the Tafsir texts of the verses were derived (Figure 2).
Figure 2. Method of Extracting Relations Using the FarsNet Ontology
The text and wording of the Qur’an are highly sacred and meticulously structured. Many Qur’anic scholars believe that the Qur’anic text possesses linguistic inimitability. Accordingly, the Arabic Qur’anic text was also employed in this study.
This research proposes a method to establish connections between two texts in different languages that are related (not necessarily translations of each other). Here, the Arabic text of the Qur’an and the Persian text of Tafsir Nūr constitute these two texts. Since they are not direct translations, one-to-one mapping cannot be applied. Instead, this study proposes a TF-IDF-based method applied separately to both texts to computationally link them.
Using the Forqan corpus (Estiri et al. 2013), the roots of Arabic words in each verse were first extracted, followed by computation of TF-IDF scores for these roots across the verses (Figure 3).
Figure 3. Method of Calculating Relations between Words Based on Verse Word Roots
This relation is derived based on the previously computed TF-IDF scores of roots and words.
n: Number of Common Roots Between Verses x and y
: TF-IDF Value for Root r (the i-th Common Root) in Verse x
: TF-IDF Value for Root r (the i-th Common Root) in Verse y
x,y: Index of Verse Pairs with More Than Three Common Roots (If x,y Share Fewer Than Three Common Roots, It Is Zero)
The matrix below is constructed using the root relations derived from the verses.
The matrix is a square matrix with rows and columns equal in number to the verses.
: Persian word and Persian word
: TF-IDF vector value of the word in the interpretation text of i-th verse
: TF-IDF vector value of the word in the interpretation text of j-th verse.
: Entry corresponding to verses \(i\) and \(j\) in the matrix.
Some of the results from these calculations with high TRR values are presented in Table 2.
Table 2. Some of the results with high TRR values.
Using the matrices obtained from the previous sections, a square matrix was constructed where the rows and columns represented the final words. The K-Means clustering method was applied to the resulting matrix. The clusters obtained are referred to as "abduction frames." Some of these clusters are presented in the table 3. Note that only a portion of the cluster members are shown for display purposes.
Table 3. Some Clusters obtained from K-Means method.
Figure 4 shows the Davies-Bouldin index values for different numbers of clusters.
Figure 4. The Davies-Bouldin Score for different numbers of clusters
As shown in Figure 4, although the criterion continues to decrease after 200 clusters, suggesting apparent improvement, an excessively large number of clusters reduces the interpretability of the clusters and may lead to overfitting of these semantic clusters. It is noteworthy that other clustering methods (such as agglomerative and divisive clustering and DBSCAN) were also employed, but they yielded significantly lower output quality compared to K-Means.
One effective approach for selecting appropriate words in constructing a semantic network is the use of a thesaurus. In this study, the Thesaurus of Qur’anic Concepts (2007) was utilized, which is available digitally through the Noor website (https://noorlib.ir/book/info/5028).
This thesaurus contains approximately 3000 single-word Qur’anic terms. Of these, around 1200 words were found in the corpus used in this research (i.e. mentioned in the resurrection verses). All previous steps were performed separately for these words. First, words not present in the thesaurus were removed. Then, the remaining steps were applied to the surviving words.
Figure 5. Davies-Bouldin Index Chart for thesaurus Word Clustering
The above figure displays the Davies-Bouldin index values for various numbers of clusters. Based on the chart, 90 clusters were selected for use. In Table 4, some resulting clusters by this method are presented.
Table 4. Some of the clusters obtained using the K-Means method in the thesaurus section with 90 clusters
The total number of words used when IDF values determined the threshold was approximately 4,000 words. Additionally, among these words were meaningless terms, errors (typos), and words that had been incorrectly stemmed. In the thesaurus section, the total selected words numbered around 1,200. Solutions were proposed to address each of these issues.
Problem 1: Some words that had been incorrectly stemmed. These words were divided into two categories: Persian and Arabic.
Solution: The Persian text processing library AIPA (https://aipaa.ir) and the Arabic text processing library Qalsadi (https://pypi.org/project/qalsadi) were utilized. Qalsadi is a Python library used for Arabic language text processing. This library also features Arabic stemming capabilities (Zerrouki 2020).
Figure 6. Root Extraction Process Using AIPA and Qalsadi Tools
After initial normalization, words were listed and then root-extracted using AIPA. In the next stage, these words (which had undergone Persian stemming) were processed by Qalsadi for Arabic Root Extraction. Given that both tools (Arabic and Persian) have errors, human judgment determined which stemming\Root-Extraction was more accurate. This file was then used for root extraction of the words. From approximately 17000 words stemmed using the Hazm library, around 15000 words remained after all.
Problem 2: Words that were meaningless or erroneous (typos).
Solution: To exclude meaningless or erroneous words, a condition was established. For a word to be included in the final word list, it must belong to at least one of the following categories:
Words not falling into these categories were removed from the corpus, leaving approximately 7000 words. For these remaining words, TF-IDF vectors, co-occurrence matrices, and TRR (Arabic stemming) were computed. The matrices were then combined according to the previous priorities. Following the formation of the final matrix, the words were clustered using the K-Means method.
Figure 7. Davies-Bouldin Index Chart for Word Clustering in the Normalization Section
Figure 7 illustrates the Davies-Bouldin index for various numbers of clusters. Based on the chart, 420 clusters were selected. Note that in both the TF-IDF threshold method for final word selection and the thesaurus-assisted selection, a large number of words clustered together in a single group. In the TF-IDF threshold approach, approximately 1300 words fell into one cluster, while in the thesaurus method, around 300 words did so. This phenomenon generally stems from two factors:
Due to this issue, a significant portion of words (about 30%) lacked interpretability and were not included in abduction frames. However, this problem was less pronounced in the normalization section, where the largest clusters contained around 400 members each, comprising less than 14% of the total words.
Table 5. Some of the clusters obtained using the K-Means method in the normalization section with 420 clusters.
For the external evaluation criterion, 10% of the clusters from each section were randomly selected, with the condition that they contain at least 10 and at most 40 members. For the TF-IDF threshold condition, 20 clusters; for the thesaurus condition, 12 clusters; and for the normalization condition, 42 clusters were randomly chosen. After cluster selection, 10 members were randomly picked from each cluster. These 10 members are identified as positive. An additional 10 members were randomly selected from other clusters. These are identified as negative. For each user, a form is displayed that asks about the degree of belonging of a word to a group of words. Users must rate the degree of a word's belonging to a group on a scale from very high, high, medium, low, and very low. Two scenarios exist for the queried word:
Two questions are asked per cluster: one for true positive and one for true negative. Each user receives a total of 6 questions. Additionally, age and gender are collected from each participant. To gather data from various individuals, advertisements were posted three times (once per experiment) in a Telegram channel (https://telegram.me/OfficialPersianTwitter) with approximately 500,000 members. Multiple advertisements were placed at meaningful time intervals to balance the data collection.
If a respondent selects "very high" or "high," it is considered a positive evaluation. If "low" or "very low" is selected, it is considered negative. "Medium" responses are excluded due to respondent uncertainty. Four values are defined for evaluation using metrics such as accuracy:
Five metrics—accuracy, precision, recall, specificity, and F-measure—were computed for evaluation.
Multiple experiments were conducted at time intervals for data collection. Across three experiments, 1,295 volunteers participated in the dynamic online questionnaire, each answering 6 questions, yielding 7,770 responses stored. These questions evaluated a total of 71 clusters.
It can be observed that there is no significant difference between the results of the first and second experiments and the combined results of the first, second, and third experiments, with outcomes showing convergence. The overall results from the three methods—TF-IDF threshold, thesaurus, and normalization—are presented in the table 6, 7, 8.
Table 6. Number of Participating Persons in each Experiment
Table 7. Evaluation Results for the Combined First, Second, and Third Experiments (values in percentages)
Table 8. Overall Results of Accuracy, Precision, Recall, Specificity, and F-Measure.
A separate questionnaire was designed for Qur’anic experts. 12 clusters were randomly selected, and from each cluster, 7 words were chosen to ensure cluster coverage and diversity. Two questions were asked per cluster:
The first option scored 4, and the last scored 1.
Seven experts participated. The average score for the first question was 3.74, and for the second, 2.6.
This study presented and evaluated a technical approach for creating association semantic networks for the Holy Qur’an. Results indicate that natural language processing methods can automatically and semi-automatically generate abduction semantic networks for the Qur’an using interpretive data, base ontologies like FarsNet, and word co-occurrences in verses.
Evaluation by Qur’anic experts shows that the outputs exhibit strong semantic connection accuracy and innovation discovery. Thus, generating abduction semantic networks for the Qur’an holds practical significance for Qur’anic and Islamic studies communities.
A key outcome of this research is the recommendation that producing abduction semantic networks using various technical methods be considered a vital and practical topic in future computational Qur’an mining research, particularly in semantics, interpretation, and cross-lingual studies application domains.