Github Repo: Archy the Anki bot (吖奇说Anki助理)
The Idea 想法
It’s been almost a year since I moved back to China. And currently I’m still struggling with Chinese.
Unlike English there are no spaces in Chinese. Figuring out the proper segmentation for Chinese words in a sentence can often be a mind-numbing task for Chinese learners, especially when the sentence contains Chinese words and characters that one is not familiar with.
It has occured to me that many Chinese learners including myself would be able to perform word segmentation more efficiently if we can preview beforehand what are the difficult words in each paragraph (i.e. Chinese words we are likely not familiar with) and have each word annotated with its pinyin and some rough definition. This would also improve the whole reading experience.
And it would be even nicer if there is a simple procedure that would enter everything we need to remember (i.e. the words we are not familiar with, together with their pinyin and definitions) into a system like Anki where we can later perform active recall and spaced repetition to develop long-term memory for these words in an efficient manner.
● ● ●
After many burnouts and failures (which included screwing up my MiraclePlus interview) and realising the video editor project that I was working on was not going anywhere, I decided I wanted to work on a chatbot assistant that can help me to be more productive learning Chinese. And maybe others will find it to be useful as well =)
And Archy the Anki Bot 0.0.1 was born.
The Use Cases 用例
1: Extract difficult Chinese words from WeChat articles.
2: Annotate Chinese words with pinyin and rough definitions (expressed in English).
3: Generate a deck of Anki notes from Chinese words.
Design & Implementation & Demo 设计与履行与演示
Basically we would have an
Lexicographer, and an
AnkiDeckGenerator. And we would integrate everything in main.ts where we handle WeChaty callbacks.
For the current use cases, we would use the
ArticleAnalysor to extract text from the WeChat Article (using
TextAnalysor to tokenise the text into words (using
jieba with a pretrained model in
Lexicographer to assign a difficulty score to each word (using an ad hoc formula with Chih-Hao’s Chinese characters meta-data), as well as to give English definitions and pinyin to selected words (using CC-CEDICT). And lastly
AnkiDeckGenerator is for generating a deck of Anki notes (using genanki).
Gluing everything together functionally and this is what we get:
What’s Next? 接下来呢？
still in the midst of planning but here are some rough ideas
As we can see the ad hoc word difficulty scoring formula isn’t performing super great at the moment. That is something I need to experiment and perhaps do some text scraping and use a combination of BERT with a self-trained model, etc to achieve a more accurate scoring system.
jiebaworks well in general but it may still give unsatisfying results (e.g. at times when a sentence contains a person’s name). Trying out different models aside, my plan is to engineer around the problem (i.e. to have results that always make sense to the users) using tools like StandfordNLP’s stanza or approach the problem differently, etc.
I’m also thinking about extending the
Lexicographerto contain definitions from different dictionaries as well as online search results that is useful to the language learners, etc.
MiniApp & Premium Version & The Future 小程序与会员版与未来打算
Anki is an amazing and very powerful tool but I feel like it is too exam-orientated in the sense that it is best utilised by people (e.g. medical students) with the aim of doing well in an upcoming exam, etc. And from a UI/UX perspective it has a steep learning curve. I’m currently working on a WeChat and TikTok MiniApp inspired by Anki but with a more laid-back take on it. The end product will a nichely designed tool for people who want to improve their Chinese with the intention to read and speak better rather than scoring well in exams. In the premium version it would come with a chatbot assistant like Archy the Anki bot.
Archy the Anki bot will always remain free and open-source on Github. I will continue to improve it as I work on the commercial aspect of the project described above so that I can continue doing this full-time and maybe it can become ramen profitable. 🍜 🍜 🍜
If things go well I would like to scale it up to cover different language learning (e.g. English, Japanese, German), as well as going beyond language learning to become a full-fledge note-taking productivity tool for autodidacts. It will be like Notion but more for remembering stuff and visualising knowledge representation. And at the core of it would be a cross-platform chatbot assistant* =) At the moment I’m reading up on how to train a model to do handwritten diagram recognition (e.g. mind map, UML, flow chart, etc) as well as looking into visual languages like Chalktalk. ⚗️ ⚗️ ⚗️
*: in general from a product perspective I believe chatbot is a great I/O into the world, especially as social media apps become the new browsers.
Huge thanks to
the WeChaty community and everyone involved in making WeChaty such a wonderful lib! And the Juzi.bot team for opening up their padplus protocol ecosystem for outsiders like me!
If you are interested in the development of this project feel free to follow Archy.sh on WeChat and TikTok or join our mailing list =)
Also please feel free to fork my repo, deploy your own bot, or just do anything with the code, or open issues if there are any! Thanks!
p.s. 写中文写到中间有些累与懒🥴「吖奇说记忆卡片」小程序上线后更多关于未来的去向（中+英）会在公众号有的看～ 感兴趣的朋友可以关注我的公众号与抖音@吖奇说～