Chinese Words to Memorize

Learning Chinese is difficult and memorizing a new word comes at price (requires independently memorizing the characters, pinyin and tone!). Recently I tried watching (with excessive pausing) the TV series 微微一笑, and with each new word I had encountered I asked myself "Is this word worth memorizing?".

what the hell is 眨眼 and should I memorize it?

It's not straightforward to answer. Examining the word's characters is meaningless - a character with many strokes is usually not more difficult to recognize than a simple character. At least for me the cost of memorizing each word is the same.

But when looking at all the words together we can estimate the word's frequency which measures how likely I am to encounter this word again (and how much less of a waste of time learning it would be!). For example in 微微一笑, words about Computer Science appear relatively frequently, so it pays off to memorize those. So my first instinct was to take all the subtitle files of the show, calculate n-gram histograms and only learn most frequent words. But after a bit of googling I realized two things:

  1. This show has no subtitle files available yet...
  2. It has been done academically on a much larger scale already!

It's all about the Corpus

Frequency depends on the body of text used (aka corpus). Cai & Brysbaert 2010 used a massive TV series and movie subtitle dataset as a corpus and they are even nice enough to share the data online. Their methodology appear sound enough for me to ignore the fact that it uses a broad database. Even so, looking at the data I get a sense that it represents modern TV much better than HSK ranking.

Making a Memorization List

Now I can make a list of Chinese words to memorize:

  1. Rank top words by their frequency
  2. Show the English definitions for each frequent word
  3. Exclude words I already know

So the rest of notebook does just that.

In [1]:
%matplotlib inline
from pylab import *

Frequency Data

Let's download the SUBTLEX-CH data and parse it. I'll be using Pandas, because it's the best tool for combining tabular data.

In [2]:
import requests, StringIO, zipfile
def download_zip(url):
    return zipfile.ZipFile(
            requests.get(url, stream=True).content))
In [3]:
freq_zip = download_zip('')
In [4]:
import pandas
with'SUBTLEX-CH-WF.xlsx') as freq_xlsx:
    freq_df = pandas.read_excel(freq_xlsx, header=2, index_col=0)
freq_df = freq_df.sort('WCount', ascending=False)
freq_df['rank'] = range(1, len(freq_df)+1) # Rank is useful
pandas.options.display.max_rows = 10
WCount W/million logW W-CD W-CD% logW-CD rank
1682530 50155.13 6.2260 6243 100.00 3.7954 1
1682285 50147.83 6.2259 6242 99.98 3.7953 2
1329424 39629.27 6.1237 6242 99.98 3.7953 3
947807 28253.52 5.9767 6243 100.00 3.7954 4
946365 28210.53 5.9761 6243 100.00 3.7954 5
... ... ... ... ... ... ... ...
让奇南 1 0.03 0.0000 1 0.02 0.0000 99117
官子 1 0.03 0.0000 1 0.02 0.0000 99118
贝艾迪 1 0.03 0.0000 1 0.02 0.0000 99119
臂腕 1 0.03 0.0000 1 0.02 0.0000 99120
外行话 1 0.03 0.0000 1 0.02 0.0000 99121

99121 rows × 7 columns

So how frequent is that word I didn't know?

In [5]:
freq_df[freq_df.index == u'眨眼']
WCount W/million logW W-CD W-CD% logW-CD rank
眨眼 243 7.24 2.3856 162 2.59 2.2095 6646

Not frequent at all, I can safely ignore it.

Now with almost 100,000 words I have to somehow determine the most cost-effective amount of words to learn. Let's check the distribution of words:

In [6]:

What an extreme distribution. Let's see the CDF

In [7]:

So by memorizing the 1000 most frequent words I should on average know 4 out of every 5 words, which is often enough for understanding a sentence from context. So we are left with

In [8]:
freq_top_df = freq_df[freq_df['rank']<=1000]
WCount W/million logW W-CD W-CD% logW-CD rank
1682530 50155.13 6.2260 6243 100.00 3.7954 1
1682285 50147.83 6.2259 6242 99.98 3.7953 2
1329424 39629.27 6.1237 6242 99.98 3.7953 3
947807 28253.52 5.9767 6243 100.00 3.7954 4
946365 28210.53 5.9761 6243 100.00 3.7954 5
... ... ... ... ... ... ... ...
3031 90.35 3.4816 944 15.12 2.9750 996
3030 90.32 3.4814 1452 23.26 3.1620 997
有意思 3029 90.29 3.4813 1894 30.34 3.2774 998
3025 90.17 3.4807 1712 27.42 3.2335 999
使用 3017 89.93 3.4796 1784 28.58 3.2514 1000

1000 rows × 7 columns

Word Definitions

Luckily there's a Creative Commons Chinese-English dictionary available ( It's supposed to be community based and updated regularly. Its format is a bit strange but I wrote this parser

In [9]:
import re

def parse_ce_dict():    
    ce_zip = download_zip('')
    for line_ascii in"cedict_ts.u8",'r'):
        if line_ascii.startswith('#'):
        line = line_ascii.decode('utf-8')
        (hanzi, pinyin, definitions) = re.findall('[^ ]+ ([^ ]+) \[(.+)\] /(.*)/', line)[0]
        yield (hanzi, pinyin, definitions)

ce_df = pandas.DataFrame(data=[x for x in parse_ce_dict()], 
                         columns=('hanzi', 'pinyin', 'definitions'))
hanzi pinyin definitions
0 % pa1 percent (Tw)
1 21三体综合症 er4 shi2 yi1 san1 ti3 zong1 he2 zheng4 trisomy/Down's syndrome
2 3C san1 C abbr. for computers, communications, and consu...
3 3P san1 P (slang) threesome
4 A A (slang) (Tw) to steal
... ... ... ...
114830 he2 old variant of 和[he2]/harmonious
114831 xie2 to harmonize/to accord with/to agree
114832 yu4 variant of 籲|吁[yu4]
114833 xx5 component in Chinese characters, occurring in ...
114834 ging1 uptight/to awkwardly force oneself to do sth/(...

114835 rows × 3 columns

Now we can check the meaning of that word

In [10]:
ce_df[ce_df.hanzi == u'眨眼']
hanzi pinyin definitions
72068 眨眼 zha3 yan3 to blink/to wink/in the twinkling of an eye

Sadly with Chinese a Hanzi character doesn't uniquely identifies a meaning. That is, some characters have several Pinyin meanings. Take 了 For example:

In [11]:
ce_df[ce_df['hanzi'] == u'了']
hanzi pinyin definitions
4136 le5 (modal particle intensifying preceding clause)...
4137 liao3 to finish/to achieve/variant of 瞭|了[liao3]/to ...
72465 liao3 (of eyes) bright/clear-sighted/to understand c...
72466 liao4 unofficial variant of 瞭[liao4]

So 了 can be pronounced in and mean 4 different things. And this is not a special case, let's group the dictionary by Hanzi (aggregating Pinyin and definition)

In [12]:
ce_hanzi_groups = ce_df.groupby('hanzi')

and look at the distribution

In [13]:
%matplotlib inline
from pylab import *
hist([len(x[1]) for x in ce_hanzi_groups], log=True);
xlabel('Different pronounciations');

Since we're dealing with character frequency, I want to create a dictionary that uniquely describes characters. Let's create one by joining piyin and definitions of the same Hanzi combination together

In [14]:
ce_hanzi_df = pandas.DataFrame(
    columns = ce_df.columns,
    for hanzi, indices in ce_hanzi_groups.groups.iteritems()))
ce_hanzi_df.index = ce_hanzi_df['hanzi']
hanzi pinyin definitions
不学无术 不学无术 bu4 xue2 wu2 shu4 without learning or skills (idiom); ignorant a...
奸险 奸险 jian1 xian3 malicious/treacherous/wicked and crafty
生化学 生化学 sheng1 hua4 xue2 biochemistry
楼兰 楼兰 Lou2 lan2 ancient oasis town of Kroraina or Loulan on th...
永宁县 永宁县 Yong3 ning2 xian4 Yongning county in Yinchuan 銀川|银川[Yin2 chuan1]...
... ... ... ...
推服 推服 tui1 fu2 to esteem/to admire
谷氨酸钠 谷氨酸钠 gu3 an1 suan1 na4 monosodium glutamate (MSG) (E621)
丝路 丝路 Si1 Lu4 the Silk Road/abbr. for 絲綢之路|丝绸之路[Si1 chou2 zh...
烂泥 烂泥 lan4 ni2 mud/mire
盥洗室 盥洗室 guan4 xi3 shi4 toilet/washroom/bathroom/lavatory/CL:間|间[jian1]

111545 rows × 3 columns

Known Words

I use Hanping Chinese for keeping track of all the words I know. One of the nice things about it is that it allows exporting the list of starred words in a CSV-like format

In [15]:
known_df = pandas.DataFrame.from_csv(r"Hanping Chinese Pro Starred Export (1).txt",
                                    parse_dates=False, sep='\t', header=-1, encoding='utf-8')
known_df = known_df.iloc[:-1] # Skip last line
known_df.columns = ('pinyin', 'definition')
known_df.index = pandas.Index([x.strip() for x in known_df.index])
known_df['hanzi'] = known_df.index
pinyin definition hanzi
宾馆 bīn guǎn guesthouse • lodge • hotel • CL: 个 (gè),家 (jiā) 宾馆
纸巾 zhǐ jīn paper towel • napkin • facial tissue • CL: 张 (... 纸巾
芋头 yù tou taro 芋头
马戏 mǎ xì circus 马戏
豆奶 dòu nǎi soy milk 豆奶
... ... ... ...
yóu swim; float • travel • rove • associate with •...
应该 yīng gāi should; ought to; must 应该
坏事 huài shì bad thing; evil deed • ruin sth.; make things ... 坏事
会作 huì zuò can do; know how to do (sth.) 会作
让一让 ràng yi ràng step aside 让一让

822 rows × 3 columns

Putting it all together

Finally we have everything to make the list

In [16]:
freq_top_def_df = freq_top_df.join(ce_hanzi_df, rsuffix='_ce')[['pinyin', 'rank', 'definitions']]
freq_top_def_df.to_csv('frequent_1000.csv', encoding='utf-8')

freq_top_def_new_df = freq_top_def_df.loc[freq_top_def_df.index.difference(known_df.index)].sort('rank')
freq_top_def_new_df.to_csv('my_frequent.csv', encoding='utf-8')

pandas.options.display.max_rows = 20
pinyin rank definitions
一个 NaN 44 NaN
e2;o2;o4;o5 63 to chant;oh (interjection indicating doubt or ...
bei4 65 quilt/by/(indicates passive-voice clauses)/(li...
需要 xu1 yao4 101 to need/to want/to demand/to require/requireme...
hei1 109 hey
en1;en4;en5 128 (a groaning sound);(nonverbal grunt as interje...
o1 149 Oh!
事情 shi4 qing5 151 affair/matter/thing/business/CL:件[jian4],樁|桩[z...
也许 ye3 xu3 164 perhaps/maybe
hai1 170 oh alas/hey!/hi! (loanword)/high (on drugs, or...
... ... ... ...
tuo1 987 to shed/to take off/to escape/to get away from
pei2 988 to accompany/to keep sb company/to assist/old ...
失踪 shi1 zong1 989 to be missing/to disappear/unaccounted for
认真 ren4 zhen1 990 conscientious/earnest/serious/to take seriousl...
真相 zhen1 xiang4 991 the truth about sth/the actual facts
法律 fa3 lu:4 992 law/CL:條|条[tiao2], 套[tao4], 個|个[ge4]
训练 xun4 lian4 993 to train/to drill/training/CL:個|个[ge4]
值得 zhi2 de5 995 to be worth/to deserve
A1;a1;e1 996 abbr. for Afghanistan 阿富汗[A1 fu4 han4];prefix ...
使用 shi3 yong4 1000 to use/to employ/to apply/to make use of

502 rows × 3 columns

You can download the most frequent 1000 words with definitions Here.

What's Next

There a few things missing here:

  1. While useful the corpus is too broad; I still want to take the corpus of the specific TV show I'm watching and analyze it.
  2. The Subtlex work count frequency of each word by it's syntactic role, which I totally ignored here.
blog comments powered by Disqus