MBTI Personality Test

class components.mbti.MBTI

The MBTI object contains functions used for MBTI Personality Test tab

Methods

clean_text(text[, lemma])

Process text

get_bar_plot(predictions)

Get figure for plot

get_feature_importance([max_num_features])

Print top feature importance for each model

get_num_words(input_text)

Get number of input words in vocabulary

get_personality_details(personality)

Get personality details (summary and details)

get_train_test(X, y[, test_size, random_state])

Splits data into training and testing data

load_and_save_data()

Reads in data, performs preprocessing and saves data

load_model(path_model)

Load and return saved best model after grid search with stratified cross validation

load_model_tf(path_model)

Load and return saved tensorflow model

load_tokenizer()

Load and return saved tokenizer

load_vectorizer()

Load and return saved vectorizer

predict_model(model, vector_test)

Perform prediction on test set

predict_model_tf(model, vector_test)

Perform prediction on test set

save_model(vector_train, y_train_series, …)

Train, save and return best model after grid search with stratified cross validation

save_model_tf(vector_train, y_train_series, …)

Train, save and return tensorflow model

save_tokenizer(corpus[, params])

Fit, save and return tokenizer

save_vectorizer(corpus[, params])

Fit, save and return vectorizer

test_pipeline(input_text)

Testing pipeline for new input text

tokenize_new_input(input_text)

Load saved tokenizer and transform input text

train_pipeline([train_vect, train_model])

Training pipeline for loading, preprocessing and model training

transform_tokenizer(tokenizer, corpus)

Transform corpus with tokenizer

transform_vectorizer(vect, corpus)

Transform corpus with vectorizer

vectorize_new_input(input_text)

Load saved vectorizer and transform input text

static clean_text(text, lemma=<WordNetLemmatizer>)

Process text

  1. Split different sentences

  2. Make words lowercase

  3. Remove URLs (i.e. http) and usernames (i.e. @username)

  4. Remove digits and punctuations

  5. Remove any mention of MBTI types

  6. Tokenize words (i.e. split the words into list)

  7. Lemmatize words (i.e. reduce words to singular form)

  8. Join text into string

Parameters
  • text (str) – input text

  • lemma – Lemmatizer (defaults to nltk WordNetLemmatizer)

Returns

processed text

Return type

(str)

static get_bar_plot(predictions)

Get figure for plot

Adds plotly.graph_objects charts for bar plot

Parameters

predictions (list) – list of model prediction probabilities

Returns

(dict)

get_feature_importance(max_num_features=10)

Print top feature importance for each model

Parameters

max_num_features (int) – number of top feature importance

get_num_words(input_text)

Get number of input words in vocabulary

Parameters

input_text (str) – input text

Returns

(int)

static get_personality_details(personality)

Get personality details (summary and details)

Parameters

personality (str) – MBTI personality results, to retrieve detailed results

Returns

(list)

static get_train_test(X, y, test_size=0.2, random_state=0)

Splits data into training and testing data

Parameters
  • X (pd.DataFrame) – processed input data

  • y (pd.DataFrame) – processed output data

  • test_size (float) – proportion of test data, defaults to 0.2

  • random_state (int) – fixed seed, allows reproducible result, defaults to 0

Returns

4-element tuple

  • X_train (pd.DataFrame): training input

  • X_test (pd.DataFrame): testing input

  • y_train (pd.DataFrame): training output

  • y_test (pd.DataFrame): testing output

load_and_save_data()

Reads in data, performs preprocessing and saves data If saved data is present, directly read in the saved data

If saved data does not exist
  1. Reads in data

  2. Insert new columns as indicator for each mbti category

  3. Process text column

  4. Save data

If saved data exist
  1. Reads in saved data

Returns

processed data

Return type

(pd.DataFrame)

static load_model(path_model)

Load and return saved best model after grid search with stratified cross validation

Parameters

path_model (str) – location and file name of saved model

Returns

model

static load_model_tf(path_model)

Load and return saved tensorflow model

Parameters

path_model (str) – location and file name of saved model

Returns

model

load_tokenizer()

Load and return saved tokenizer

Returns

keras Tokenizer

load_vectorizer()

Load and return saved vectorizer

Returns

(sklearn.CountVectorizer)

static predict_model(model, vector_test)

Perform prediction on test set

Parameters
  • model (model) – model to be used for prediction

  • vector_test (scipy.csr_matrix) – vectorized training input

Returns

y_pred (np.ndarray)

predict_model_tf(model, vector_test)

Perform prediction on test set

Parameters
  • model (model) – model to be used for prediction

  • vector_test (np.ndarray) – vectorized training input

Returns

y_pred (np.ndarray)

static save_model(vector_train, y_train_series, path_model)

Train, save and return best model after grid search with stratified cross validation

Parameters
  • vector_train (scipy.csr_matrix) – vectorized training input

  • y_train_series (pd.Series) – training output, one-column subset of y_train

  • path_model (str) – location and file name of saved model

Returns

model

save_model_tf(vector_train, y_train_series, path_model)

Train, save and return tensorflow model

Parameters
  • vector_train (np.ndarray) – vectorized training input

  • y_train_series (pd.Series) – training output, one-column subset of y_train

  • path_model (str) – location and file name of saved model

Returns

model

save_tokenizer(corpus, params=None)

Fit, save and return tokenizer

Parameters
  • corpus (pd.Series) – input text corpus (training input)

  • params (dict) – specifies parameters for tokenizer, defaults to None

Returns

(keras.Tokenizer)

save_vectorizer(corpus, params=None)

Fit, save and return vectorizer

Parameters
  • corpus (pd.Series) – input text corpus (training input)

  • params (dict) – specifies parameters for vectorizer, defaults to None

Returns

(sklearn.CountVectorizer)

test_pipeline(input_text)

Testing pipeline for new input text

Parameters

input_text (str) – input text

Returns

2-element tuple

  • personality (str): MBTI personality results, to be shown in title of bar plot

  • predictions (list): list of tuple of model prediction probabilities

tokenize_new_input(input_text)

Load saved tokenizer and transform input text

Parameters

input_text (str) – input text

Returns

tokenized input_text

Return type

(np.ndarray)

train_pipeline(train_vect=False, train_model=False)

Training pipeline for loading, preprocessing and model training

Parameters
  • train_vect (bool) – indicates whether to retrain vectorizer, defaults to False

  • train_model (bool) – indicates whether to retrain models, defaults to False

Returns

NA

transform_tokenizer(tokenizer, corpus)

Transform corpus with tokenizer

Parameters
  • tokenizer (keras Tokenizer) – tokenizer to be used to transform text corpus

  • corpus (pd.Series) – input text corpus

Returns

tokenized text corpus

Return type

vector_corpus (np.ndarray)

transform_vectorizer(vect, corpus)

Transform corpus with vectorizer

Parameters
  • vect (sklearn.CountVectorizer) – vectorizer to be used to transform text corpus

  • corpus (pd.Series) – input text corpus

Returns

vectorized text corpus

Return type

vector_corpus (scipy.csr_matrix)

vectorize_new_input(input_text)

Load saved vectorizer and transform input text

Parameters

input_text (str) – input text

Returns

vectorized input_text

Return type

(scipy.csr_matrix)