Tokyo, Japan – July 10th, 2024 – Morpho AI Solutions, Inc. (hereinafter “Morpho AIS”), which is responsible for the commercialization of AI within the Morpho Group, has been supplying an AI-OCR output service since 2023 for use in generating training data for Japanese LLMs.
It is proud to report that it has recently been contracted by the Research Organization of Information and Systems and the National Institute of Informatics to develop AI-OCR for Japanese academic papers. Through this project, Morpho AIS will contribute to the development of a domestic LLM with powerful Japanese capabilities, an effort being spearheaded by the National Institute of Informatics.
On April 1, 2024, the National Institute of Informatics established a new Research and Development Center for Large Language Modelsm(*1) to research and develop new LLMs. This center will serve as one of the sites in the Ministry of Education, Culture, Sports, Science and Technology’s project to establish research and development sites to ensure the greater transparency and reliability of generative AI models. This LLM R&D center is preparing a corpus, building a computation environment, creating evaluation benchmarks, and performing other activities with the goal of creating a Japanese-made LLM with 175 billion parameters. At the same time, it is also building an LLM for R&D use.
At the LLM R&D center, progress is being made on extracting text data from Japanese-language academic papers in PDF form. Extracting text from these academic paper PDFs requires pre-processing such as analyzing their layout (text flow) and their structure (inferring the areas containing the main bodies of the papers). Many of the current tools that provide these functions were tuned for use on English-language papers, so there is a need for versatile, general purpose tools that can extract text from all kinds of Japanese-language papers, not just those of specific academic journals.
Morpho AIS was contracted by the LLM R&D center to develop AI-OCR functions for identifying the layouts unique to Japanese academic papers and to extract the text from their main bodies. This will contribute to the generation of a large amount of high quality Japanese-language text data, which is essential for the creation of an LLM in Japan.
OCR is essential for converting documents stored in image form into text, but most of the OCR products on the markets were developed for use with forms such as billing statements or receipts. Japanese documents have diverse layouts (using vertical writing, horizontal writing, and multiple columns) and have a mix of character types. This has made it difficult to accurately extract Japanese text, including sequential reading order, using commercial OCR products.
The AI-OCR output service provided by Morpho AIS is capable of high accuracy text generation, including correctly identifying text reading order, something that commercial OCR products struggle with. Organizations can therefore use their scanned image data to generate diverse and accurate Japanese-language data sets, assisting with the creation of training data for Japanese-language LLMs.
Digitalization of existing documents (company histories, public relations magazines, public records, meeting minutes, etc.) and conversion into LLM training data
1: AI-OCR that supports a wide range documents, not just forms
- Reproduces the reading order, which is important for LLM input
- Supports roughly 7,000 characters and can read even highly difficult kanji characters
2: Can output test (in various formats) from miscellaneous documents containing images (JPEGs, PDFs, PNGs)
This service is already being used to generate text in various organizations, including the National Diet Library.
(Tomigusuku City in Okinawa Prefecture, University of Bologna, Juntendo University, Shiga Prefectural Library, etc.)
FROG AI-OCR is a single package that combines the high resolution OCR processing of NDLOCR, which makes it easy to perform OCR, with correction and text output functions. All of its functions can be used via the cloud, enabling highly efficient confirmation and correction of output text. FROG AI-OCR uses NDLOCR (https://github.com/ndl-lab/ndlocr_cli), published by the National Diet Library under the CC BY license, as its core engine.
*1: Research and Development Center for Large Language Models Established at National Institute of Informatics
– Accelerating R&D to Develop Domestic LLMs and Ensure Transparency and Reliability of Generative AI Models –
12/19/2023
Morpho AI Solutions Launches Japanese-language Dataset Generation Service for LLMs
Morpho AI Solutions is a company engaged in the commercialization of AI (Artificial Intelligence). It promotes the introduction and actual operation of cutting-edge AI technologies, including AI-OCR, in the areas of social infrastructure such as government, electric power, transportation, and manufacturing.
For more information, visit https://www.morphoai.com/ or contact contact@morphoai.com.