EduNLP.Pretrain

class EduNLP.Pretrain.GensimSegTokenizer(symbol='gms', depth=None, flatten=False, **kwargs)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g. gms, fgm

  • depth (int or None) – 0: only separate at SIFSep ; 1: only separate at SIFTag ; 2: separate at SIFTag and SIFSep ; otherwise, separate all segments ;

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> tokenizer = GensimSegTokenizer(symbol="fgm", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], ['[FORMULA]'], ['如图'], ['[FIGURE]'], ['[FORMULA]'],...['[FORMULA]'], ['最大值'], ['[MARK]']]
class EduNLP.Pretrain.GensimWordTokenizer(symbol='gm', general=False)[source]
Parameters
  • symbol (str) –

    select the methods to symbolize:

    ”t”: text, “f”: formula, “g”: figure, “m”: question mark, “a”: tag, “s”: sep,

    e.g.: gm, fgm, gmas, fgmas

  • general (bool) –

    True: when item isn’t in standard format, and want to tokenize formulas(except formulas in figure) linearly.

    False: when use ‘ast’ mothed to tokenize formulas instead of ‘linear’.

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimWordTokenizer(symbol="gmas", general=True)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', 'x', ',', 'y', '约束条件', '公式', '[FORMULA]']
>>> tokenizer = GensimWordTokenizer(symbol="fgmas", general=False)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item.tokens[:10])
['公式', '[FORMULA]', '如图', '[FIGURE]', '[FORMULA]', '约束条件', '公式', '[FORMULA]', '[SEP]', '[FORMULA]']
EduNLP.Pretrain.train_vector(items, w2v_prefix, embedding_dim=None, method='sg', binary=None, train_params=None)[source]
Parameters
  • items:str – the text of question

  • w2v_prefix

  • embedding_dim (int) – vector_size

  • method (str) – the method of training, e.g.: sg, cbow, fasttext, d2v, bow, tfidf

  • binary (model format) – True:bin; False:kv

  • train_params (dict) – the training parameters passed to model

Returns

tokenizer

Return type

Tokenizer

Examples

>>> tokenizer = GensimSegTokenizer(symbol="gms", depth=None)
>>> token_item = tokenizer("有公式$\FormFigureID{wrong1?}$,如图$\FigureID{088f15ea-xxx}$,    ... 若$x,y$满足约束条件公式$\FormFigureBase64{wrong2?}$,$\SIFSep$,则$z=x+7 y$的最大值为$\SIFBlank$")
>>> print(token_item[:10])
[['公式'], [\FormFigureID{wrong1?}], ['如图'], ['[FIGURE]'],...['最大值'], ['[MARK]']]
>>> train_vector(token_item[:10], "examples/test_model/data/gensim_luna_stem_t_", 100) 
'examples/test_model/data/gensim_luna_stem_t_sg_100.kv'