Chinese Text Analyzer (in Python)

Now that I’ve finished writing mnemonics for all HSK characters I started reading 活着 by 余华 and needed to figure out which characters I should learn (i.e. write mnemonics for) to enjoy reading this novel. Therefore I wrote a simple Chinese Text Analyzer which counts the appearance of each character and sorts by HSK. Here it is:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sys
import re

filename = sys.argv[1] # get the filename of the file to analyze

with open("hskchars.csv") as hskcharsfile:
    dictionary = {}
    for line in hskcharsfile:
        hanzi = line.strip().split(" ", 2)[1]
        count = int(line.strip().split(" ", 2)[0])
        dictionary[hanzi] = count

with open(filename) as textfile:
    text =

not_chinese = re.sub('[\u4E00-\u9FFF]', '', text)
text = re.sub('[^\u4E00-\u9FFF]', '', text)

items = sorted(list(set(text)))
total_count = len(text)

results = []
for hanzi in items:
    counts = text.count(hanzi)
    percentage = float(counts) / total_count * 100
    HSK = dictionary.get(hanzi)
    if HSK == None:
        HSK = 7
    results.append([hanzi, counts, HSK, percentage])

results.sort(key = lambda row: (-row[2], row[3]), reverse=True)

printed_results = []
for r in results:
    if r[2] == 7:
        r[2] = ""
    printed_results.append("{}\t{:6d}\t{}\t{:2.4f}".format(r[0], r[1], r[2], r[3]))

print("# Discarded characters: {}".format(sorted(list(set(not_chinese)))))
print("# Total count: {}".format(total_count))

The file “hskchars.csv” lives in the same directory and is a text file which consist on each line 1. a Chinese character, 2. a space ” ” 3. the HSK level. You can download it here (no guarantee that it’s correct, though):

The script is called on the command line in this way:

ython3 ~/Audiobooks/Huozhe/huozhe.txt

In this example, huozhe.txt contains the text of the book. As for speed, for this novel the script completes instantly (~0.2 s).

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.