Open-source in generative AI holds a promise for universities in developing countries
Generative AI is a widely used term thanks now to ChatGPT, which is based on a Large Language Model (LLM) called Generative Pretrained Transformer. In general, the use of LLMs in education can be transformative. For instance, they can be used to create intelligent tutoring systems capable of providing personalised learning experiences to students. These systems can answer students’ questions, provide explanations and even generate practice problems. LLMs can be used to translate OER into local languages, making education more accessible.
Educators have noted some challenges when LLMs are used extensively in teaching. Concerns include the accuracy of outputs, implicit and explicit biases, and cultural appropriateness of the outputs. While some of the explicit biases can be addressed, there is no clarity on the removal of implicit biases.
A major concern is privacy. It has been shown that an adversary can extract/reconstruct the exact training samples from the LLMs, which can lead to the revelation of personally identifiable information. Ethical concerns in AI include how the training data for an LLM was acquired. It should be a concern for educators that use the models.
Recent models of GPT are commercial and cannot be repurposed. Open-source LLMs have emerged as promising tools, especially for developing countries. These models, pre-trained on vast amounts of data, can be finetuned to perform various tasks, from language translation to answering complex questions, making them a versatile asset in the educational sector. Like open-source software, they can offer a wide range of services in education with a lower cost of ownership. They can be downloaded and hosted locally. Some can be run using consumer-grade computers in an institutional network.
Several open-source LLMs are available today, and the number is growing. Bloom is the largest LLM available in the open-source domain with about 176 billion parameters. It can generate outputs in 46 human languages and 13 programming languages. Four different models in the family of Large Language Model Meta AI (LLaMA), owned by Meta (parent of Facebook), have been made available to the public — the largest having 65 billion parameters, pre-trained with quality data. LLaMA2 is a fully open LLM.
Pre-training LLMs is a resource-intensive process, often requiring significant computational power and financial investment. However, when in the open domain, these models allow third parties, researchers and practitioners to fine-tune them in using their institutional or private data to accomplish their AI tasks. This approach reduces the cost of leveraging LLMs, making them accessible to institutions with limited resources, such as universities in developing countries. The process of fine-tuning can also help address some of the concerns about accuracy or explicit biases.
Among the new open-source LLMs, Vicuna 13B is gaining popularity. It is a fine-tuned version of LLaMA, and some claim that it is comparable in performance to GPT-4 and Google Bard. The cost of fine-tuning Vicuna 13B was about USD 300.
A significant development is the release of Falcon 40B by the Institute of Innovation in the United Arab Emirates. It is a pre-trained model with high-quality data of about 750 billion words or about a trillion tokens. As a foundational LLM, it can be fine-tuned for any task. A lower version with seven billion parameters is also available, which can be run at a reasonable cost. Falcon 40B is an example of how a nationally co-ordinated effort can invest in creating its own high-quality LLM for unrestricted use.
The Russell Group of universities in the UK recently published a statement of principles for using generative AI in education. The principles include promoting AI literacy, building staff capacity, adapting AI tools and systems and maintaining academics while sharing best practices, rigour and integrity.
Open-source LLMs can help universities adhere to these principles and ethics of AI. They can also be offlined and used in universities, which can fine-tune them for performing relevant AI tasks.