Tokyo, Japan – December 19th, 2023– Morpho AI Solutions, Inc. (hereinafter “Morpho AIS”), which is responsible for the commercialization of AI within the Morpho Group, announced today that it has begun providing an AI-OCR output service for generating Japanese-language LLM training data.
This service provides highly accurate and diverse Japanese-language text data for organizations (such as companies, government agencies, and local governments) considering creating their own LLMs and AI companies and research institutions developing LLMs.
Creating high-quality Japanese-language LLMs requires the collection of diverse Japanese-language data. However, most of the Japanese-language text data that can be easily collected is text from the 1990s onward, after the rise of the Internet. Many of the documents from before 1990 (such as company histories, public relations magazines, public records, meeting minutes, and the like) have yet to be digitized, so this data cannot be efficiently collected. Many organizations creating LLMs therefore find themselves unable to collect diverse Japanese-language training data, and instead must use publicly-available, shared datasets. This limits their ability to create high-quality LLMs.
OCR is essential for digitizing saved documents, but the majority of the OCR products on the market were developed for use with billing statements, receipts, and other forms. Japanese documents have diverse layouts (using vertical writing, horizontal writing, and multiple columns) and have a mix of character types. This has made it difficult to accurately extract Japanese text, including sequential reading order, using commercial OCR products.
The OCR output service provided by Morpho AIS is capable of high resolution text generation, including correctly identifying text reading order, something that commercial OCR products struggle with. Organizations can therefore use their scanned image data to generate diverse and accurate Japanese-language data sets, assisting with the creation of training data for Japanese-language LLMs.
Digitalization of existing documents (company histories, public relations magazines, public records, meeting minutes, etc.) and conversion into LLM training data
1: AI-OCR that supports a wide range documents, not just forms
– Reproduces the reading order, which is important for LLM input
– Supports roughly 7,000 characters and can read even highly difficult kanji characters
2: Can output test (in various formats) from miscellaneous documents containing images (JPEGs, PDFs, PNGs)
This service is already being used to generate text in various organizations, including the National Diet Library.
(Tomigusuku City in Okinawa Prefecture, University of Bologna, Juntendo University, Shiga Prefectural Library, large newspaper companies, etc.)
https://frog-ai-ocr.morphoai.com/
A free trial is also available from this page.
FROG AI-OCR is a single package that combines the high resolution OCR processing of NDLOCR, which makes it easy to perform OCR, with correction and text output functions. All of its functions can be used via the cloud, enabling highly efficient confirmation and correction of output text. FROG AI-OCR uses the National Diet Library’s NDLOCR (https://github.com/ndl-lab/ndlocr_cli) as its core engine.
Morpho AI Solutions is a company engaged in the commercialization of AI (Artificial Intelligence). It promotes the introduction and actual operation of cutting-edge AI technologies, including AI-OCR, in the areas of social infrastructure such as government, electric power, transportation, and manufacturing.
For more information, visit https://www.morphoai.com/ or contact contact@morphoai.com.