Dear Colleagues, Site updates: Added Կորպուսային լեզվաբանություն. Ներածություն - added summaries for more lectures. I asked the Perplexity to write Tree banks usage in linguistics scientific essay. translate What is language? into Russian: Что такое язык? explain Armenian classic-revised orthography conversion
Added the User's Guide to OCR Data Pipeline, where I described the functionality of OCR system from the end-user perspective. I described my experience with making text from the 4 tomes of Ačaryan's dictionary. The system is production ready (and worthy: "believe me" (D.J.Trump)). Completed OCR conversion of all 4 volumes of Ačaryan Dictionary. Started manual editing of the v.1, which appears to be the most challenging since it has several types of pages that OCR does not handle well: details in the User's Guide to OCR Data Pipeline.
Site news: Renamed Running Text Processor page to Running the Armenian Parser. I'll change the program (application, .jar) name later on (have no time now), because almost everything on this site is a Text Processor -:) - which all of them definitely are. Positioning the program as a spell checker - which it definitely is - was also confusing. The "official" name of the program now is the Eastern Armenian Text Parser. It takes Armenian text and produces a sequence of tagged lexemes. The lexemes that cannot be tagged are considered non-Armenian - misspelled. Spell checking is a byproduct of parsing/tagging. Renamed Corpus Data Pipeline to OCR Data Pipeline. This is a more accurate name, because it reflects the functionality that the system provided rather than what it can be used for. You can use it to build corpora, thesaurus, electronic books, etc. Updated the Corpus Data Pipeline and the Administering Text Processors. Implemented automatic: Image processing: spectrogram (spectrum histogram) creation, image rotation Page tilt automatic detection,
I find these podcasts very interesting. Creating similar materials for Armenian - might be a written (book, blog, article) material - could induce interest towards linguistics in youth in Armenia (more youngsters will come to Universities to become linguists and collaborate with me on Thesaurus (Գանձարան), corpora building, and the Workbench implementation - I don't believe in people older than 26 anymore -:)): Why do we say "hello"? Can a word be its own opposite? | CONTRONYMS Did the Greeks have no word for blue? COLOR WORDS Check out other videos - I learned a lot
For about a year I am trying to get into the contact with a representative of Science and Education committee of Armenian government. Any help is greatly appreciated. Both chatGPT and Perplexity switched to GPT-4. chatGPT is more specific - it uses GPT-4o mini.
Collaboration: I am looking for help (for now - volunteers) in: Review and compare technologies listed on Հումանիտար տեխնոլոգիաների կայքէջեր page. UX designer for this - the ՀԹՏ - site, as well as editors/creators/contributors. I envisioned it as a collaboration platform, rather than a site for promoting my Բնական խոսքի ընդհանրական ներկայացման մի տարբերակի մասին or - other books. Compute and storage infrastructure (with DevOps and DataOps engineers) for services to support Linguists'/Philogists'/Lexicographers' Workbench. Investigate phases of [Armenian] language produced by neural networks (GPT) and verify if "poverty of input" is applicable to neural nets - for more details see Why GPT is not a language model? section in What is language? Armenian Corpora and morphology test. Edit Wikipedia Armenian articles: we can make it a valuable knowledge base for laymen, students, and researches alike. Let me know if you know a forum, a hang out for Armenian Wikipedia activists (enthusiasts). Work on introducing generative AI (GPT) into Armenian science and education system - prepare courses to learn AI usage basics, prompt engineering. Linguists and software engineers for developing and supporting: The Running the Arm Parser The OCR Data Pipeline Գանձարան Linguists'/Philogists'/Lexicographers' Workbench (see section 16 in Բնական խոսքի ընդհանրական ներկայացման մի տարբերակի մասին and Lecture #4 in Կորպուսային լեզվաբանություն. Ներածություն, as well as the diagram in Թվային հումանիտար գիտություններ). Creating a Workbench, which provides access to Գանձարան and the below #7.5. Armenian corpora is an important scientific and educational tool (much more important than a Statue of Jesus or a Scientific Town). It can put [experimental] linguists on an equal footing with [experimental] physicists - the Workbench + Գանձարան +. Armenian corpora can become LHC or JWST of linguistics. It is particularly important for Armenian linguistics, because the theoretical linguistics might be in a good shape (mostly due to works of Jahukyan & Co in 60s-80s), while contemporary experimental linguistics is practically 0, despite couple of successful corpus implementations. Besides the opinion that "theoretical linguistics is in a good shape" might be shattered (may be not that dramatic -:) by the corpora research. Armenian corpora: Armenian periodicals and newspapers (pre-Soviet, Soviet, post-Soviet, post-Independence) corpus Armenian oral tradition corpus (philogists, ethnographers (cultural anthropologists?) should be involved) Armenian current spoken dialects corpus. Yerevan (Gyumri, Vanadzor, etc.) dialect daily spoken yearly snapshots corpus Armenian TV, radio (Soviet, post-Soviet, post-Independence), podcasts corpus. A Section: Cinema, Theatre including TV, radio productions corpus. Armenian non-fictional (scientific, scholastic) corpus Armenian Songs lyrics - folk, classic, soviet, rabis, hip-hop. Kurdish (ezidis) oral and written (Riya Taza) corpura. Lomaveren (Armenian L/Roma peoples) language corpus. Caucasian Albanian language corpus South caucasus Russian periodicals corpus South caucasus and Russia's Armenian periodicals corpus South caucasus and Russia's in Russian Armenian periodicals corpus EU and USA Armenian periodicals corpus Middle Eastern (+Türkiye, +Persia) Armenian periodicals corpus Turkish literary, non-fiction, periodicals (including written in Armenian letters) corpus Azerbaijani literary, non-fiction, periodicals (pre-Soviet, Soviet, post-Soviet) corpus Persian literary, non-fiction, periodicals corpus
I would appreciate your help in finding paying - I have some not-paying -:) - customers for OCR projects - OCR Data Pipeline. Possible interested parties are: Libraries: National, Academy of Sciences, University, Local, etc. Publisher houses Archives Courtrooms Individuals, that want to digitize old, valuable books in Armenian and translate them, for example, into Kiswahili by GPT-4o, to spread (I can also help with publishing and spreading - Agoulis) Armenian thought in Africa.
Disclaimer: If I have commercial (or any other) interest in the applications or the resources that I recommend, then I explicitly declare it. |
|
|