Dear Colleagues,

 

Site updates:

Added

  1. Կորպուսային լեզվաբանություն. Ներածություն - added summaries for more lectures.

  2. I asked the Perplexity to

    1. write Tree banks usage in linguistics scientific essay.

    2. translate What is language? into Russian: Что такое язык?

    3. explain Armenian classic-revised orthography conversion

  3. Added the User's Guide to OCR Data Pipeline, where I described the functionality of OCR system from the end-user perspective. I described my experience with making text from the 4 tomes of Ačaryan's dictionary. The system is production ready (and worthy: "believe me" (D.J.Trump)).

  4. Completed OCR conversion of all 4 volumes of Ačaryan Dictionary. Started manual editing of the v.1, which appears to be the most challenging since it has several types of pages that OCR does not handle well: details in the User's Guide to OCR Data Pipeline.

 

Site news:

  1. Renamed Running Text Processor page to Running the Armenian Parser. I'll change the program (application, .jar) name later on (have no time now), because almost everything on this site is a Text Processor -:) - which all of them definitely are. Positioning the program as a spell checker - which it definitely is - was also confusing. The "official" name of the program now is the Eastern Armenian Text Parser. It takes Armenian text and produces a sequence of tagged lexemes. The lexemes that cannot be tagged are considered non-Armenian - misspelled. Spell checking is a byproduct of parsing/tagging.

  2. Renamed Corpus Data Pipeline to OCR Data Pipeline. This is a more accurate name, because it reflects the functionality that the system provided rather than what it can be used for. You can use it to build corpora, thesaurus, electronic books, etc.

  3. Updated the Corpus Data Pipeline and the Administering Text Processors. Implemented automatic:

    1. Image processing: spectrogram (spectrum histogram) creation, image rotation

    2. Page tilt automatic detection,

  4. I find these podcasts very interesting. Creating similar materials for Armenian - might be a written (book, blog, article) material - could induce interest towards linguistics in youth in Armenia (more youngsters will come to Universities to become linguists and collaborate with me on Thesaurus (Գանձարան), corpora building, and the Workbench implementation - I don't believe in people older than 26 anymore -:)):

    1. Why do we say "hello"?

    2. Can a word be its own opposite? | CONTRONYMS

    3. Did the Greeks have no word for blue? COLOR WORDS

    4. Check out other videos - I learned a lot

  5. For about a year I am trying to get into the contact with a representative of Science and Education committee of Armenian government. Any help is greatly appreciated.

  6. Both chatGPT and Perplexity switched to GPT-4. chatGPT is more specific - it uses GPT-4o mini.

 

Collaboration:

I am looking for help (for now - volunteers) in:

  1. Review and compare technologies listed on Հումանիտար տեխնոլոգիաների կայքէջեր page.

  2. UX designer for this - the ՀԹՏ - site, as well as editors/creators/contributors. I envisioned it as a collaboration platform, rather than a site for promoting my Բնական խոսքի ընդհանրական ներկայացման մի տարբերակի մասին or - other books.

  3. Compute and storage infrastructure (with DevOps and DataOps engineers) for services to support Linguists'/Philogists'/Lexicographers' Workbench.

  4. Investigate phases of [Armenian] language produced by neural networks (GPT) and verify if "poverty of input" is applicable to neural nets - for more details see Why GPT is not a language model? section in What is language? Armenian Corpora and morphology test.

  5. Edit Wikipedia Armenian articles: we can make it a valuable knowledge base for laymen, students, and researches alike. Let me know if you know a forum, a hang out for Armenian Wikipedia activists (enthusiasts).

  6. Work on introducing generative AI (GPT) into Armenian science and education system - prepare courses to learn AI usage basics, prompt engineering.

  7. Linguists and software engineers for developing and supporting:

    1. The Running the Arm Parser

    2. The OCR Data Pipeline

    3.  Գանձարան

    4. Linguists'/Philogists'/Lexicographers' Workbench (see section 16 in Բնական խոսքի ընդհանրական ներկայացման մի տարբերակի մասին and Lecture #4 in Կորպուսային լեզվաբանություն. Ներածություն, as well as the diagram in Թվային հումանիտար գիտություններ). Creating a Workbench, which provides access to Գանձարան and the below #7.5. Armenian corpora is an important scientific and educational tool (much more important than a Statue of Jesus or a Scientific Town). It can put [experimental] linguists on an equal footing with [experimental] physicists - the Workbench + Գանձարան +. Armenian corpora can become LHC or JWST of linguistics. It is particularly important for Armenian linguistics, because the theoretical linguistics might be in a good shape (mostly due to works of Jahukyan & Co in 60s-80s), while contemporary experimental linguistics is practically 0, despite couple of successful corpus implementations. Besides the opinion that "theoretical linguistics is in a good shape" might be shattered (may be not that dramatic -:) by the corpora research.

    5. Armenian corpora:

      1. Armenian periodicals and newspapers (pre-Soviet, Soviet, post-Soviet, post-Independence) corpus

      2. Armenian oral tradition corpus (philogists, ethnographers (cultural anthropologists?) should be involved)

      3. Armenian current spoken dialects corpus.

      4. Yerevan (Gyumri, Vanadzor, etc.) dialect daily spoken yearly snapshots corpus

      5. Armenian TV, radio (Soviet, post-Soviet, post-Independence), podcasts corpus. A Section: Cinema, Theatre including TV, radio productions corpus.

      6. Armenian non-fictional (scientific, scholastic) corpus

      7. Armenian Songs lyrics - folk, classic, soviet, rabis, hip-hop.

      8. Kurdish (ezidis) oral and written (Riya Taza) corpura.

      9. Lomaveren (Armenian L/Roma peoples) language corpus.

      10. Caucasian Albanian language corpus

      11. South caucasus Russian periodicals corpus

      12. South caucasus and Russia's Armenian periodicals corpus

      13. South caucasus and Russia's in Russian Armenian periodicals corpus

      14. EU and USA Armenian periodicals corpus

      15. Middle Eastern (+Türkiye, +Persia) Armenian periodicals corpus

      16. Turkish literary, non-fiction, periodicals (including written in Armenian letters) corpus

      17. Azerbaijani literary, non-fiction, periodicals (pre-Soviet, Soviet, post-Soviet) corpus

      18. Persian literary, non-fiction, periodicals corpus

 

I would appreciate your help in finding paying - I have some not-paying -:) - customers for OCR projects - OCR Data Pipeline. Possible interested parties are:

  1. Libraries: National, Academy of Sciences, University, Local, etc.

  2. Publisher houses

  3. Archives

  4. Courtrooms

  5. Individuals, that want to digitize old, valuable books in Armenian and translate them, for example, into Kiswahili by GPT-4o, to spread (I can also help with publishing and spreading - Agoulis) Armenian thought in Africa.

Disclaimer:

If I have commercial (or any other) interest in the applications or the resources that I recommend, then I explicitly declare it.

Site Map:

ՀԹՏ Home page

+ Թվային հումանիտար գիտություններ

+ Հումանիտար տեխնոլոգիաների կայքէջեր

+ Հայոց ՀԹՏ իրացումներ

- Կորպուսային լեզվաբանություն 

*_Բնական խոսքի ընդհանրական ներկայացման մի տարբերակի մասին

** Չհրապարակված հատվածներ

^ Խոսքի նկարագրությունը Համընդհանուր Կախվածություններով

] Լեզվաբանություն, ԱԲ, եւ ՀԹՏ

# ՆԱՅԻՐԻ բառարաններ

$ Հ.Աճառյան. ՀԱԲ. ՅԱՌԱՋԱԲԱՆ

) Գանձարան

 

Pages in English

= Interviewing ChatGPT

== 1. Language basics

== 2. Natural Language Processing (NLP)

== 3. Corpus linguistics

== 4. Armenian Corpora and morphology test  

? Reviewing the Thought-Based Linguistics

! What is language?

*_On Syntactic Structure Representation

} Running the Armenian Parser - Linux command line Eastern Armenian Parser (spellcheck, tagging)

+ Armenian Characters set review

~ OCR Data Pipeline - User's Guide to OCR Data Pipeline.

 

Страницы на русском:

  1. Что такое язык?

Your street address
Your phone number

Share on social

Share on FacebookShare on X (Twitter)Share on Pinterest

Check out our site  
This email was created with Wix.‌ Discover More