Recognition System for Libyan Entity Names

— Named Entity Recognition (NER) is a computational linguistic concept that is used to find and classify appropriate nouns in a text such as person names, geographical locations, and organizations. Such a concept is fundamental in the field of natural language processing. In Libya, many private and public institutions suffer from using the proper translation of entity names from Arabic language into English. Therefore, in this paper, we are concerned with analyzing Arabic articles to extract and recognize entity names. A recognition system is developed for recognizing names of persons, academic institutions, and cities in Libya. At first, a training corpus and dictionaries are built for the intended entity names in this research. Then, the aspects of the entity names are studied, and their patterns and rules are designed. Then, the implementation is performed using Nooj linguistic language. The recognition of person names and Libyan cities and academic institutions was carried out. Statistics showed the frequencies of the appearance rate of person names, academic institutions, and cities in our training corpus. The obtained results are promised and met the research goals for tackling the problem of Arabic named entity recognition.


I. INTRODUCTION
Natural language processing (NLP) is an area that can be used to realize and process natural language text to accomplish valuable tasks by computers. The aim of NLP comprises building of real computational applications to explore the features of natural language used by human and to develop systems including the smart processing of human written language via computer. It deals with both concept and uses for a variety of applications. NLP still has a lot of interest in software development and research. The most common applications of NLP comprise information retrieval and extraction, machine translation, question answering, data classification and summarization, dialogue applications, and named entities recognition (NER) [1].
NER can be defined as the task that attempts to find and classify named entities into predefined classes or types in open-domain and unstructured texts, such as newspaper articles. It is a part of data classification applications in specific fields such as sport, agriculture, medicine, etc.
Basically, the NER is a subtask of information extraction which involves the identification and classification into a set Published  of predefined sorts of importance. In NER, the expression "named entity" covers not only proper names but also includes temporal expressions and some numerical expressions to extract time, date, and address expressions from letters. Named Entities occur frequently in Arabic texts, and their recognition is essential. Recognizing and categorizing NE requires both internal (morphological) and external (syntactic) indications [2].
Arabic NER has arisen and received an attention of researchers in recent years [3]- [6]. In this work, we are concerned with analyzing Arabic texts and phrases to extract the entity names and recognizing the person names and Libyan cities. The main goal is providing the Libyan society recently and in the future with important services through proposing an informatics system to recognize the concerned entity names and translating them properly from Arabic to English. We have used Nooj linguistic language for developing the proposed system [7].
The remainder of this paper is organized as follows. Section II introduces the related work. Section III presents the architecture of the proposed transliteration system. Section IV shows the experimental results. Section V presents the evaluation of proposed system. Finally, conclusions and future work are presented in Section VI.

II. RELATED WORK
Research in natural language processing has a lot of interest in software development and research. Named Entity Recognition (NER) has become a motivating research area. In the nineteen's, in particular at the Message Understanding Conferences, Named Entity Recognition was first introduced as an information extraction task and deemed important by the research community [8].
A NER system is considered an important preprocessing tool for tasks such as document classification of clustering, machine translation, information extraction, information retrieval tasks, indexing and search and other text processing applications. It is very valuable for many applications such as web mining and information retrieval. Such applications are used for recognizing and retrieving relevant information from a set of data which matched to an input query [9].
Several research works have been achieved using local grammars as finite state local automata have been used to recognize person names in textual documents [10], [11].
Reference [12] introduced a rule-based approach for K. @ @ Recognition System for Libyan Entity Names Abdelsalam A. Almarimi and Ezzedin M. Enbiah recognizing Arabic person names. Traboulsi in [13] discusses the use of corpus linguistics methods and techniques, in conjunction with the so-called local grammar formalism to identify patterns of person names in Arabic news texts. Shaalan in [14] has presented a review for Arabci NER and classification and describing the recent activities and growth regarding Arabic NER study and the importance of Arabic NER features of languages are emphasized. Likewise, Abdallah in [15] has defined a local grammar as a way of describing syntactic restrictions of certain subsets of sentences.

III. PROPOSED MODEL
The architecture of the proposed system is shown in Fig. 1. It is implemented using Nooj linguistic language. The main components of the proposed system consist of: 1) A Nooj corpus is built and used as a storage of the related Libyan data and articles which we have collected. The data is represented in the corpus as a set of annotations. Such a corpus was extended by implementing a specific web crawler that we have developed. 2) Lexical resources are designed and implemented. They comprise dictionaries and morphological grammars. A dictionary of a language contains all of the lemmas (entries) associated with possible meaning. Fig. 2 shows sample of person names and cities' dictionary. 3) Morphological grammars are a finite state automaton that reads (recognizes) sequences of letters of a word and associate them with linguistic information. 4) Syntactic grammars are set of patterns designed and represented using Nooj's linguistic engine as Transducers for recognizing the intended entity names. 5) Concordance tables are used for the output to display the matched results of the applied grammar. A concordance table as shown below can be contained four columns, the first column shows the source file of the result (Text), the second, and fourth columns (After, Before) show 5 words before and after recognized entity, and the third column (Seq.) displays the matched result.

6) RDBMS stands for Relational Database Management
System is MySql server that used for storing the obtained results of concordance tables. 7) The web interface is the interactive screen that enabled users to search the translation/transliteration of the recognized entity names by querying MySql database using PHP code.

IV. DESIGNING THE LINGUISTIC GRAMMARS
The intended entity names are classified as formalized patterns. These patterns are designed or expressed as: three basic patterns for person names, cities, and institutions. These patterns are represented with grammars using Nooj. Fig. 3 shows the designed recognition grammar for person names. This grammar is a finite state graph consists of set of nodes that starts with initial state, followed by transition states, and ends with final state. Syntactic grammar reads text word by word (sequence of tokens). The <N+FirstName> is a transition state represents a rule that the token is a noun and a first-person name conforms an entry in the person name dictionary.

A. Designing the Grammar of Person Names
To improve the performance of the local grammars, we added new local grammars as trigger (TriggerParent) to involve the relative relations of persons such as 'Son of: ‫بن‬ ‫أو‬ ‫."ابن‬ Also, we added another local grammar as trigger (TriggerPerson) to involve the adjectives of person names such Eng., Dr., Prof., Mr., Mrs, etc.

B. Designing the Grammar of Cities
Recognizing city names is based on the designed linguistic grammar and dictionary.

C. Designing the Grammar of Academic Institutions
Recognizing academic institutions is based on the Designed formal linguistic patterns for the academic institutions in Libya. The academic institutions are classified into three patterns: universities, faculties, and higher institutes. For example, the patterns of faculty name are classified into 3 patterns as follows: Building the local grammars of our designed patterns as shown in Fig. 5 which represent the main graph (Transducer). We classified institutions into University, Faculty, Higher Institute, and Research Center. Fig. 6 shows a sample of academic organization recognition grammar.

V. EXPERIMENTAL RESULTS
The experiments of recognition that are accomplished by applying the designed grammars and dictionaries on the created corpus. These obtained results are achieved by crawling some Libyan articles which are available online. The content of articles of three months are stored as XML files and added to the corpus, the link 'www.libyaakhbar.com/Libya-news/' is used as a seed link for the crawler. Fig. 7 shows sample of the obtained results of the recognized person names. To get more details about the result, we added the feature of <ENAMEX+PERS+Title> as shown in Fig. 8, that means the recognized entity is a name of person in addition to the title of the person if exist (Trigger-Parent or Trigger-Person).

B. Recognition of Libyan Cities
Recognition of Libyan cities are obtained by applying the designed grammars and cities' Nooj dictionary on the created corpus. Fig. 9 shows sample of the obtained results of the recognized Libyan city person names.

C. Recognition of Libyan Academic Institutions
Recognition of Libyan academic institutions are obtained by applying the designed grammars and org' Nooj dictionary on the created corpus. Fig. 10 shows a sample of the obtained recognized results.

VI. SYSTEM EVALUATION
For the evaluation and performance of our designed system, we have statistically showed the appearance rate of person name, academic institutions, and cities frequencies in our training corpus. Fig. 11, 12 and 13 show such rate respectively. Fig. 11. The appearance rate of person names.  The precision was computed to measure the accurate of the obtained results by computing the number of true results which represent the correct recognition. Table I shows the true recognition of sample of person names and Libyan cities that are used for testing our system. The values obtained in the evaluation of recall and fmeasure are shown in Table II. The lack of precision of academic institutions is related to some institutions are not Libyan.

VII. CONCLUSION
In this paper, we introduced a system for recognizing Arabic person names and Libyan cities and academic institutions form electronic news texts. The contribution of this paper is to provide an approach for recognizing Libyan entity names. We used Nooj linguistic Environment to design and implement our recognition model. In our experiments, we found that the recognition of a person names may lead into wrong result in the case of names are adjectives such as "Saeed: ‫,"سعيد‬ "Intisar: ‫."إنتصار‬ The result of cities recognition is very accurate due to the use of our designed dictionary which contains all Libyan cities. Ambiguity can be obtained such as "Bishir city: ‫بشر‬ ‫"مدينة‬ which can be denoted for "Human: ‫َر‬ ‫َش‬ ‫"ب‬ meaning or Bishir city, "Nisma city: ‫ة‬ َ ‫ْم‬ ‫ِس‬ ‫"ن‬ which is denoted also for people "Nasama: ‫ة‬ َ ‫م‬ َ ‫َس‬ ‫."ن‬ One significant reason that has importantly influenced the above realized results is the non-standardization of writing Arabic manuscript. Most of them are free loaded with variations forms of Arabic script. Standard can support reach for more accurate results. The developed system was evaluated on a testing corpus and the obtained results were very promising. In the future, we are going to achieve the following: 1) Extending our corpus for large coverage of recognizing the intended names by our developed crawler system to extract texts of daily Libyan magazines to be stored as XML files and then added to the corpus. 2) Extending the designed grammars to be used for translating the intended entity names from Arabic into English. 3) Developing a friendly GUI for using the proposed system.