Transformersライブラリで日本語BERTを動かす

Transformersライブラリを使用すると、簡単に学習済み日本語BERTモデル(東北大)が利用できるようなので、備忘録として投稿します。以前、セグメンテーションモデル(HRNetV2)のファインチューニングはしたことがあるのですが、テキストモデルは初めてです。

Transformers公式ドキュメントはこちら

BertJapanese

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

環境構築
動かしてみた
まとめ

環境構築

使用する環境は、以下の通りです。今回は動作確認のみで、GUPは使わずCPU上で動かします。
・windows10
・miniforge
・CUDA11.6

# 仮想環境作成、起動
$ conda create -n pytorch112 python
$ conda activate pytorch112

# torchをインストール、コマンドはpytorchの公式ドキュメントより
$ pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

# transformersをインストール、[ja]を指定することで日本語用Tokenizerのfugashiがインストールされる
$ pip install transformers[ja]

# モデル構造の確認用にtorchinfoもインストール
$ pip install torchinfo

transformersライブラリと一緒にTokenizerもインストールできるようで非常にお手軽ですね。fugashiはMeCabのラッパーで、辞書はUniDicがインストールされます。

東北大BERTはいくつか公開されているようです。

・cl-tohoku/bert-large-japanese
BERT large model(24 layers, 1024 dimensions of hidden states, 16 attension heads)
学習データ：Wikipedia 4GB, 30M sentences
MeCab(Unidic 2.1.2)で形態素解析の後、サブワードに分割(vocabulary 32768)
トレーニング：whole word masking

・cl-tohoku/bert-large-japanese-char
BERT large model(24 layers, 1024 dimensions of hidden states, 16 attension heads)
学習データ：Wikipedia 4GB, 30M sentences
MeCab(Unidic 2.1.2)で形態素解析の後、１文字単位に分割(vocabulary 6144)
トレーニング：whole word masking

・cl-tohoku/bert-base-japanese-v2
BERT base model(12 layers, 768 dimensions of hidden states, 12 attension heads)
学習データ：Wikipedia 4GB, 30M sentences
MeCab(Unidic 2.1.2)で形態素解析の後、サブワードに分割(vocabulary 32768)
トレーニング：whole word masking

・cl-tohoku/bert-base-japanese-char-v2
BERT base model(12 layers, 768 dimensions of hidden states, 12 attension heads)
学習データ：Wikipedia 4GB, 30M sentences
MeCab(Unidic 2.1.2)で形態素解析の後、１文字単位に分割(vocabulary 6144)
トレーニング：whole word masking

動かしてみた

それでは実際に、jupyter notebook上で動かしてみます。

学習済みモデル（東北大BERT）の読み込み。

from transformers import BertModel
from torchinfo import summary

bert = BertModel.from_pretrained('cl-tohoku/bert-base-japanese-v2')
summary(bert)

===========================================================================
Layer (type:depth-idx)                             Param #
===========================================================================
BertModel                                          --
├─BertEmbeddings: 1-1                              --
│    └─Embedding: 2-1                              25,165,824
│    └─Embedding: 2-2                              393,216
│    └─Embedding: 2-3                              1,536
│    └─LayerNorm: 2-4                              1,536
│    └─Dropout: 2-5                                --
├─BertEncoder: 1-2                                 --
│    └─ModuleList: 2-6                             --
│    │    └─BertLayer: 3-1                         7,087,872
│    │    └─BertLayer: 3-2                         7,087,872
│    │    └─BertLayer: 3-3                         7,087,872
│    │    └─BertLayer: 3-4                         7,087,872
│    │    └─BertLayer: 3-5                         7,087,872
│    │    └─BertLayer: 3-6                         7,087,872
│    │    └─BertLayer: 3-7                         7,087,872
│    │    └─BertLayer: 3-8                         7,087,872
│    │    └─BertLayer: 3-9                         7,087,872
│    │    └─BertLayer: 3-10                        7,087,872
│    │    └─BertLayer: 3-11                        7,087,872
│    │    └─BertLayer: 3-12                        7,087,872
├─BertPooler: 1-3                                  --
│    └─Linear: 2-7                                 590,592
│    └─Tanh: 2-8                                   --
===========================================================================
Total params: 111,207,168
Trainable params: 111,207,168
Non-trainable params: 0
===========================================================================

Tokenizerを用意します。

from transformers import BertJapaneseTokenizer
tknz = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')

分かち書き、数値化を試します。引数のadd_special_tokensで[CLS]と[SEP]を含めるかを選べるようです。

print(tknz.tokenize("今朝はご飯と味噌汁を食べた。"))
print(tknz.encode("今朝はご飯と味噌汁を食べた。"))
print(tknz.encode("今朝はご飯と味噌汁を食べた。", add_special_tokens=False))

['今', '##朝', 'は', 'ご飯', 'と', '味噌', '汁', 'を', '食べ', 'た', '。']
[2, 1109, 7176, 897, 31893, 890, 22855, 3146, 932, 12409, 881, 829, 3]
[1109, 7176, 897, 31893, 890, 22855, 3146, 932, 12409, 881, 829]

取り合えず、モデルに文章を突っ込むためにtorchのtensor形式にします。

import torch
input_data = tknz.encode("今朝はご飯と味噌汁を食べた")
input_data = torch.LongTensor(input_data).unsqueeze(0)
print(input_data)

tensor([[    2,  1109,  7176,   897, 31893,   890, 22855,  3146,   932, 12409,
           881,   829,     3]])

そしてモデルに突っ込みます。

output = bert(input_data)
print(output.last_hidden_state.shape)
print(output.pooler_output.shape)

torch.Size([1, 13, 768])
torch.Size([1, 768])

出力は768次元ベクトルで、隠れ層の最終層は13×768テンソルになっています。13は分かち書き後の単語数と同じです。

これでとりあえず動かせることが分かりました。深層学習系のモデルは動かすまでが大変だったりするのですが、このライブラリはかなり手軽に使用できるようです。

まとめ

Transformersライブラリで日本語BERTを動かしました。今回は、文書データをただモデルに突っ込んで出力を確認しただけなので、次はクラス分類タスクでファインチューニングをさせてみようと思います。