Sunday, February 1, 2015

Download Japanese radicals

I wanted to create a database of Japanese Kanji radicals and their readings/meaning. If you have no idea what I am talking about, this is the time to leave because nothing of the following is likely to be of any interest to you ;)

Now, Wikipedia has a nice table of Japanese radicals but it is in HTML and the data is not in a directly accessible format. So, time for some Python-fu!

The code below downloads the Wikipedia page, extracts the data using regular expressions, fixes some of the broken table entries and writes a tab-separated file of the following format:

1 一 1 one いち 42
2 丨 1 line ぼう 21
3 丶 1 dot てん 10
4 丿 1 bend の 33
5 乙,乚 1 second,latter おつ 42
6 亅 1 hook はねぼう 19
7 二 2 two ふた 29
8 亠 2 lid なべぶた 38
...
213 龜 11 turtle,tortoise かめ 24
214 龠 17 flute やく 19

Note that this is some quick and dirty code that is not especially elegant or efficient but it does the job and if you ever need to download some radical data, here it is:

  
import re
import urllib.request

def load_page(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    return response.read().decode('utf-8')
    
def parse_rows():          
    url = 'http://en.wikipedia.org/wiki/Table_of_Japanese_kanji_radicals'
    html = load_page(url)
    r_radical = r'<a.*?title="wikt:Index:Chinese radical/.">(.+?)'
    r_alt = r'</a>(.*?)</span></td>\n'
    r_strokes = r'<td>(.*?)</td>\n'
    r_meaning = r'<td>(.*?)'
    r_reading = r'<span.+?xml:lang="ja">(.*?)</span>.*?</span></td>\n'
    r_freq = r'<td>(.*?)</td>'
    regex = re.compile(r_radical+r_alt+r_strokes+r_meaning+r_reading+r_freq)
    for i,(radical,alt,strokes,meaning,reading,freq) in enumerate(regex.findall(html)):
        no = i+1
        alt = [ch for ch in set(alt) if ord(ch) > 256]
        if no == 125: alt = ['耂']   #fix for bad format of radical 125
        radicals = ','.join([radical]+alt)
        meaning = meaning.replace(', ',',').strip()
        if no == 80:   #fix for bad format of radical 80
            reading = 'なかれ;はは'
        if no == 132:   #fix for bad format of radical 132
            reading = meaning
            meaning = 'own'
        yield no,radicals,strokes,meaning,reading,freq
                
def write_rows(filepath):
    with open(filepath, mode='w', encoding='utf-8') as f:
        for row in parse_rows():      
            f.write("%d\t%s\t%s\t%s\t%s\t%s\n"%row)
        
if __name__ == "__main__":
    print('running...')
    write_rows("radicals.utf8")
    print('done.')

Sorry, for the mediocre formatting and the lack of syntax-highlighting but the Python plugin gets confused with regular expressions containing HTML code and this is hand-formatted code. A nicer version of the code can be found in my learning blog.