Decoding NLP: From Human Language to Tech Advancements

Joshua Kitili & Chebet Koros
September 17, 2024
Artificial Intelligence

Natural Language Processing (NLP) is a field at the intersection of Artificial Intelligence (AI) and Linguistics¹ focused on enabling computers to understand and generate human languages.² NLP emerged in the 1950s to simplify tasks for users and fulfil the desire to interact with computers using natural language.³

Overview of NLP

NLP employs either rule-based or machine learning methods to comprehend the structure and meaning of text.⁴ It powers chatbots, voice assistants, text scanning programs, translation applications, and enterprise software that enhance business operations, boost productivity, and streamline various processes.⁵ The most well-known NLP tool is GPT-3 from OpenAI, which uses artificial intelligence and statistical methods to predict the next word in a sentence based on the preceding ones.⁶ NLP practitioners refer to tools like this as “language models”.⁷ These models can be utilized for basic analytics tasks, such as document classification and sentiment analysis, as well as more advanced tasks like question answering and summarization.⁸

Some common benefits of NLP include increasing productivity by automating tasks, thereby reducing costs and enhancing overall efficiency for businesses.⁹ By leveraging NLP, businesses can develop innovative products and services, such as customer service chatbots, language translation applications, and question-answering systems, that transform customer interactions and offer new, cutting-edge solutions.¹⁰ Additionally, NLP can enhance decision-making by analyzing large volumes of text data, providing insights that support informed decisions.¹¹

To provide a deeper insight into NLP, this blog will explore its key components, their functionalities, and the challenges encountered in the development of NLP tools.

Exploring NLP Components and their Functionality

NLP has two primary components, namely Natural Language Generation (NLG) and Natural Language Understanding (NLU).¹²

Natural Language Generation (NLG) involves automatically producing human-readable text or speech from structured data or inputs.¹³ NLG systems analyse input data to grasp its meaning and generate coherent and contextually appropriate language output.¹⁴ This involves utilizing various techniques, from rule-based systems to advanced deep learning models, to convert structured information into natural language.¹⁵ In order to formulate responses appropriately, NLG applications must take into account language rules concerning morphology, lexicons, syntax and semantics.¹⁶

Natural Language Understanding (NLU), on the other hand, enables machines to comprehend and interpret human language by extracting metadata from content.¹⁷ It focuses on grasping the meaning of a sentence through syntactic and semantic analysis of the text.¹⁸ Recognizing that human languages are complex, characterized by intricate, nuanced, and ever-evolving meanings, poses significant challenges for computers and other devices to fully grasp.¹⁹ Nevertheless, NLU systems enable organizations to develop products and tools that go beyond interpreting words, delving into the underlying meanings they convey.²⁰ Notably, NLU facilitates dialogue with computers using human-based language.²¹

In light of this, NLP models function by establishing correlations among the basic components of language such as letters, words, and sentences present in a text dataset.²² Thus, NLP architectures employ various techniques for data preprocessing, feature extraction, and model construction.²³ In essence, NLP involves two main phases: data preprocessing and algorithm development.²⁴ Data preprocessing entails cleaning and preparing text data to make it analysable by machines.²⁵ This process organizes data into a usable format and emphasizes features in the text that an algorithm can utilize.²⁶

There are various techniques which may be employed in data preprocessing, and they include: tokenization, stop word removal, lemmatization and stemming, and sentence segmentation.²⁷

Tokenization: replaces sensitive information with non-sensitive information or a token, commonly applied in payment transactions to safeguard credit card data.²⁸
Stop word removal: eliminates common words from the text, allowing unique and informative words to stand out.
Sentence segmentation: divides a large block of text into meaningful segments that typically correspond to individual sentences.²⁹
Lemmatization and stemming: standardizes the various inflected forms of words in a text dataset to a single base form or dictionary entry, referred to as a ‘lemma’ in computational linguistics.³⁰ This technique is valuable in information retrieval systems like search engines, where users enter a single-word query such as ‘meditate’ but anticipate results that include any inflected forms of the word like ‘meditates’ or ‘meditation.’³¹

Once the data has been preprocessed, an algorithm is developed to process it, employing various NLP algorithms.³² Two main types are commonly used: the rule-based system, which relies on meticulously crafted linguistic rules that have been used since early stages of NLP development and are still employed today, and the machine learning-based system.³³ Machine learning algorithms utilize statistical methods to learn tasks from training data, adjusting their approaches as they process more data.³⁴ These algorithms leverage machine learning, deep learning, and neural networks to refine their rules through continuous processing and learning.³⁵

Challenges facing NLP Tool Development

Over the years, several NLP tools have been developed, the most famous being chatbots and language models.³⁶ These include Eliza, which was developed in the mid-1960s; Tay, a chatbot by Microsoft launched in 2016; BERT, Generative Pre-Trained Transformer 3 (GPT-3), Language Model for Dialogue Applications (LaMDA) and Mixture of Experts (MoE).³⁷

The challenges in NLP can be categorised into technical and linguistic.³⁸ From a technical perspective, NLP relies on a machine’s statistical and learning capabilities, with corpus data³⁹ being the primary material for training machine learning models in language processing.⁴⁰ However, developing a robust corpus demands significant manual effort and time, often rendering many corpora inaccessible, thereby impeding progress in machine learning for language acquisition.⁴¹

Secondly, in daily life, many conversations and texts are initiated based on shared knowledge between participants. However, current NLP models, trained on large corpora, often lack awareness of historical or cultural contexts, as well as common sense.⁴² This deficiency in background knowledge complicates machines’ ability to grasp crucial factors during the processing of conversations and texts.⁴³

From a linguistic perspective, machines encounter several challenges in comprehending natural language. These challenges encompass the diversity of languages spoken across various countries and regions, making it difficult for machines to adopt a uniform approach for processing them.⁴⁴ Additionally, the ambiguity inherent in language means that the same sentence or word in a conversation or text can carry multiple meanings, further complicating machine understanding.⁴⁵ Machines also need to exhibit consistent performance in processing different accents and styles of conversation, which adds to the robustness requirement of language processing.⁴⁶

Further, language comprehension heavily relies on practical knowledge, presenting a common challenge for both technology and language.⁴⁷ Also, machines must grasp language within its specific environment and context, highlighting the crucial role of contextual understanding in effective language processing.⁴⁸ For instance, linguistic diversity among African languages presents a major challenge for NLP, especially in regions where there is limited familiarity with indigenous datasets.⁴⁹ The use of unrelated linguistic datasets may not accurately capture the nuances and complexities of African languages.⁵⁰

African languages, despite their rich linguistic diversity, remain significantly under-represented in NLP research and technology.⁵¹ Most existing NLP tools and datasets are focused on high-resource languages, leaving a notable gap in resources and tools for African languages.⁵² This under-representation is further complicated by the scarcity of machine-readable data,⁵³ historical impact of colonialism that has often sidelined local languages,⁵⁴ and technical challenges arising from the complexity and ambiguity of African languages.⁵⁵ Various initiatives and communities have emerged to tackle these challenges,⁵⁶ including Masakhane Research Foundation,⁵⁷ KenCorpus⁵⁸ and Ghana NLP,⁵⁹ which focus on creating datasets, developing tools, and fostering collaboration to advance NLP for African languages.

To address the challenges in NLP tool development, human oversight is essential in training models to rectify errors and provide nuanced understanding that current Artificial Intelligence (AI) models may overlook.⁶⁰ In cases where there is a scarcity of high-quality data, various techniques can be utilised to enhance datasets. These techniques include synthetic data generation, where new data is generated based on existing patterns, and data augmentation, which involves making slight alterations to existing data to generate new examples.⁶¹

Conclusion

NLP significantly enhances our interaction with technology by increasing productivity, enabling innovative products, and improving customer experiences. By employing components such as NLG and N LU, NLP systems can generate coherent language outputs and comprehend human language’s intricate meanings. Although several challenges are yet to be overcome, current models have already had a transformative impact, enabling more user-friendly virtual assistants and enhanced human-computer interactions, indicating a future where communication with machines is as seamless as with humans. Considering these advancements, one should ask themselves: what will be the impact on intellectual property rights and licensing in NLP tool development?

Acknowledgements

This blog draws inspiration from the insightful ideas of Dr. Melissa Omino.

Image by pikisuperstar on Freepik

1 Parakash M Nadkarni, Lucila Ohno-Machado and Wendy W Chapman, Natural language processing: An Introduction <https://www.researchgate.net/publication/51576224_Natural_language_processing_An_introduction > accessed 12 July 2024

2 Diksha Khurana and others, Natural Language Processing: State of The Art, Current Trends and Challenges <https://www.igntu.ac.in/eContent/IGNTU-eContent-803947345413-MA-Linguistics-4-HarjitSingh-ComputationalLinguistics-2.pdf> accessed 12 July 2024

3 Khurana (n 2)

4 Alexander S. Gillis, Natural Language Processing (NLP) <https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP#:~:text=Natural%20language%20processing%20(NLP)%20is,of%20artificial%20intelligence%20(AI). > accessed 12 July 2024

5 ibid

6Ross Gruetzmacher, The Power of Natural Language Processing (19 April 2022)< https://hbr.org/2022/04/the-power-of-natural-language-processing > accessed 12 July 2024

7 ibid

8 ibid

9 Orkun Orulluoglu, Natural Language Processing: Current Uses, Benefits and Basic Algorithms (12 August 2023) < https://medium.com/@bayramorkunor/natural-language-processing-current-uses-benefits-and-basic-algorithms-963fffa722a7> accessed 12 July 2024

10 ibid

11 ibid

12 Turing, How does Natural Language Processing Function in AI? < https://www.turing.com/kb/natural-language-processing-function-in-ai> accessed 15 July 2024

13 Diwakar R. Tripathi and Abha Tamrakar, Natural Language Generation: Algorithms and Applications < https://www.researchgate.net/publication/380340899_Natural_Language_Generation_Algorithms_and_Applications> accessed 15 July 2024

14 ibid

15 ibid

16 Eda Kavlakoglu, NLP vs. NLU vs. NLG: the differences between three natural language processing concepts < https://www.ibm.com/blog/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/> accessed 15 July 2024

17 Turing (n 12)

18 Dhruv Gupta, NLU v NLG: Unveiling the two sides of Natural Language Processing (2 April 2024) < https://medium.com/@researchgraph/natural-language-processing-f58aeeb908df> accessed 15 July 2024

19 ibid

20 ibid

21 Alexander S. Gillis, Natural Lnaguage Understanding (NLU) < https://www.techtarget.com/searchenterpriseai/definition/natural-language-understanding-NLU> accessed 15 July 2024

22 DeepLearning.AI, A Complete Guide to Natural Language Processing (11 January 2023) < https://www.deeplearning.ai/resources/natural-language-processing/> accessed 15 July 2024

23 ibid

24 Gillis (n 21)

25 ibid

26 ibid

27 DeepLearning.AI (n 22)

28 Gillis (n 21)

29 ibid

30 Jacob Murel and Eda Kavlakoglu, What are stemming and lemmatization (10 December 2023) <https://www.ibm.com/topics/stemming-lemmatization#f01 > accessed 15 July 2024

31 ibid

32 Gillis (n 21)

33 ibid

34 ibid

35 ibid

36 DeepLearning.AI (n 22)

37 ibid

38 Dingli Chen, ‘Challenges of Natural Language Processing from a Linguistic Perspective’ (2024) 13(2) International Journal of Education and Humanities 217-219

39 Corpus data refers to texts or speech used for linguistic analysis, varying from small samples to extensive databases of texts or recordings. It is crucial for NLP since it provides a large and structured set of data for training and testing language models (See https://waywithwords.net/landing/the-value-of-corpus-data-in-nlp-and-srt/)

40 Chen (n 38)

41 ibid

42 ibid

43 ibid

44 ibid

45 ibid

46 ibid

47 ibid

48 ibid

49 Nelson Ndugu and Rashmi Margani, From Local to Global: Navigating Linguistic Diversity in the African Context <https://arxiv.org/pdf/2305.01427 > accessed 16 July 2024

50 ibid

51 Wanjawa, B., Wanzare, L., Indede, F., McOnyango, O., Ombui, E., & Muchemi, L. (2023, July 8). Kencorpus: A Kenyan language corpus of Swahili, Dholuo and Luhya for natural language processing tasks. arXiv.org. https://arxiv.org/abs/2208.12081

52 Chesire Emmanuel &Kipkebut Andrew, Current State, Challenges and Opportunities for Natural Language Processing Research and Development in Africa: A Systematic Review, ICLR 2024 Workshop AfricaNLP Submission17, 03 Mar 2024, https://openreview.net/forum?id=9CsL0PvDDV accessed 16 September 2024

53 Tao, C. (2022, August 16). Supporting natural language processing (NLP) in Africa. Google. https://blog.google/intl/en-africa/company-news/technology/supporting-natural-language-processing/

54 Chijioke Okorie and Vukosi Marivate, ‘How African NLP experts are navigating the challenges of copyright, innovation, and access’ (April 30, 2024), Carnegie Endowment for International Peace. <https://carnegieendowment.org/research/2024/04/how-african-nlp-experts-are-navigating-the-challenges-of-copyright-innovation-and-access?lang=en> accessed 16 September 2024

55 Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.

56 Okorie (n 54)

57 Masakhane, A grassroots NLP community for Africa, by Africans, https://www.masakhane.io/

58 “Kencorpus: Kenyan Languages Corpus,” Harvard Dataverse, https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6N5V1K.

59 Ghana Natural Language Processing, https://ghananlp.org/

60 Redress Compliance, Challenges in NLP and Overcoming Them (26 March 2024) < https://redresscompliance.com/challenges-in-nlp-and-overcoming-them/> accessed 16 July 2024

61 ibid

Decoding NLP: From Human Language to Tech Advancements

Leave a Comment - Cancel reply