Topic-Noise Models

gdtm.models.tnd module

class gdtm.models.tnd.TND(dataset=None, k=30, alpha=50, beta0=0.01, beta1=25, noise_words_max=200, iterations=1000, top_words=20, topic_word_distribution=None, noise_distribution=None, corpus=None, dictionary=None, mallet_path=None, random_seed=1824, run=True, workers=4)

Bases: object

Topic-Noise Discriminator (TND). The original Topic-Noise Model, this model is best used in an ensemble with other models, such as LDA (NLDA), or the Guided Topic Model (GTM).

Parameters
  • dataset – list of lists, required.

  • k – int, optional: Number of topics to compute in TND.

  • alpha – int, optional: Alpha parameter of TND.

  • beta0 – float, optional: Beta_0 parameter of TND.

  • beta1 – int, optional Beta_1 (skew) parameter of TND.

  • noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.

  • iterations – int, optional: Number of training iterations for TND.

  • top_words – int, optional: Number of words per topic to return.

  • topic_word_distribution – dict, optional: Pre-trained topic-word distribution.

  • noise_distribution – dict, optional: Pre-trained noise distribution.

  • corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.

  • dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.

  • mallet_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.

  • random_seed – int, optional: Seed for random-number generated processes.

  • run – bool, optional: If true, run model on initialization, provided data is provided.

  • workers – int, optional: Number of cores to use for computation of TND.

get_noise_distribution(tnd_noise_words_max=None)

takes self.noise_distribution and tnd_noise_words_max returns a list of (noise word, frequency) tuples ranked by frequency

Parameters

tnd_noise_words_max – number of words to be returned

Returns

list of (noise word, frequency) tuples

get_topics(top_words=None)

Takes top_words and self.topics, returns a list of topic lists of length top_words

Parameters

top_words – number of words per topic

Returns

list of topic lists

gdtm.models.nlda module

class gdtm.models.nlda.NLDA(dataset=None, tnd_k=30, tnd_alpha=50, tnd_beta0=0.01, tnd_beta1=25, tnd_noise_words_max=200, tnd_iterations=1000, lda_iterations=1000, lda_k=30, phi=10, topic_depth=100, top_words=20, tnd_noise_distribution=None, lda_tw_dist=None, lda_topics=None, corpus=None, dictionary=None, mallet_tnd_path=None, mallet_lda_path=None, random_seed=1824, run=True, tnd_workers=4, lda_workers=4)

Bases: object

Noiseless Latent Dirichlet Allocation (NLDA). An ensemble topic-noise model consisting of the noise distribution from TND and the topic-word distribution from LDA. Input the raw data and compute the whole model, or input pre-computed distributions for faster inference.

Parameters
  • dataset – list of lists, required.

  • tnd_k – int, optional: Number of topics to compute in TND.

  • tnd_alpha – int, optional: Alpha parameter of TND.

  • tnd_beta0 – float, optional: Beta_0 parameter of TND.

  • tnd_beta1 – int, optional Beta_1 (skew) parameter of TND.

  • tnd_noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.

  • tnd_iterations – int, optional: Number of training iterations for TND.

  • lda_iterations – int, optional: Number of training iterations for LDA.

  • lda_k – int, optional: Number of topics to compute in LDA.

  • phi – int, optional: Topic weighting for noise filtering step.

  • topic_depth – int, optional: Number of most probable words per topic to consider for replacement in noise filtering step.

  • top_words – int, optional: Number of words per topic to return.

  • tnd_noise_distribution – dict, optional: Pre-trained noise distribution

  • lda_tw_dist – dict, optional: Pre-trained topic-word distribution.

  • lda_topics – list of lists, optional: Pre-computed LDA topics.

  • corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.

  • dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.

  • mallet_tnd_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.

  • mallet_lda_path – path to Mallet LDA code, required: Path should be path/to/mallet-lda/bin/mallet.

  • random_seed – int, optional: Seed for random-number generated processes.

  • run – bool, optional: If true, run model on initialization, if data is provided.

  • tnd_workers – int, optional: Number of cores to use for computation of TND.

  • lda_workers – int, optional: Number of cores to use for computation of LDA.

get_noise_distribution(tnd_noise_words_max=None)

takes self.tnd_noise_distribution and tnd_noise_words_max returns a list of (noise word, frequency) tuples ranked by frequency

Parameters

tnd_noise_words_max – number of words to be returned

Returns

list of (noise word, frequency) tuples

get_topics(top_words=None)

takes top_words and self.topics, returns a list of topic lists of length top_words

Parameters

top_words – number of words per topic

Returns

list of topic lists

Embedded Topic-Noise Models

gdtm.models.etnd module

class gdtm.models.etnd.eTND(dataset=None, k=30, alpha=50, beta0=0.01, beta1=25, noise_words_max=200, iterations=1000, top_words=20, topic_word_distribution=None, noise_distribution=None, corpus=None, dictionary=None, mallet_path=None, embedding_path=None, closest_x_words=3, random_seed=1824, run=True, workers=4)

Bases: gdtm.models.tnd.TND

Embedded Topic-Noise Discriminator (eTND). The embedded version of TND, this model is best used in an ensemble with other models, such as LDA (NLDA), or the Guided Topic Model (GTM).

Parameters
  • dataset – list of lists, required.

  • k – int, optional: Number of topics to compute in TND.

  • alpha – int, optional: Alpha parameter of TND.

  • beta0 – float, optional: Beta_0 parameter of TND.

  • beta1 – int, optional Beta_1 (skew) parameter of TND.

  • noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.

  • embedding_path – filepath, required: path to trained word embedding vectors.

  • closest_x_words – int, optional: The number of words to sample from the word embedding space each time a word is determined to be a noise word.

  • iterations – int, optional: Number of training iterations for TND.

  • top_words – int, optional: Number of words per topic to return.

  • topic_word_distribution – dict, optional: Pre-trained topic-word distribution.

  • noise_distribution – dict, optional: Pre-trained noise distribution.

  • corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.

  • dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.

  • mallet_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.

  • random_seed – int, optional: Seed for random-number generated processes.

  • run – bool, optional: If true, run model on initialization, provided data is provided.

  • workers – int, optional: Number of cores to use for computation of TND.

gdtm.models.enlda module

class gdtm.models.enlda.eNLDA(dataset=None, tnd_k=30, tnd_alpha=50, tnd_beta0=0.01, tnd_beta1=25, tnd_noise_words_max=200, tnd_iterations=1000, lda_iterations=1000, lda_k=30, phi=10, topic_depth=100, top_words=20, tnd_noise_distribution=None, lda_tw_dist=None, lda_topics=None, corpus=None, dictionary=None, mallet_tnd_path=None, mallet_lda_path=None, embedding_path=None, closest_x_words=3, random_seed=1824, run=True, tnd_workers=4, lda_workers=4)

Bases: gdtm.models.nlda.NLDA

Embedded Noiseless Latent Dirichlet Allocation (eNLDA). An ensemble topic-noise model consisting of the noise distribution from TND and the topic-word distribution from LDA. Input the raw data and compute the whole model, or input pre-computed distributions for faster inference. Uses word embedding vectors to enhance the TND noise distribution.

Parameters
  • dataset – list of lists, required.

  • tnd_k – int, optional: Number of topics to compute in TND.

  • tnd_alpha – int, optional: Alpha parameter of TND.

  • tnd_beta0 – float, optional: Beta_0 parameter of TND.

  • tnd_beta1 – int, optional Beta_1 (skew) parameter of TND.

  • tnd_noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.

  • tnd_iterations – int, optional: Number of training iterations for TND.

  • lda_iterations – int, optional: Number of training iterations for LDA.

  • lda_k – int, optional: Number of topics to compute in LDA.

  • phi – int, optional: Topic weighting for noise filtering step.

  • topic_depth – int, optional: Number of most probable words per topic to consider for replacement in noise filtering step.

  • embedding_path – filepath, required: Path to trained word embedding vectors.

  • closest_x_words – int, optional: The number of words to sample from the word embedding space each time a word is determined to be a noise word.

  • top_words – int, optional: Number of words per topic to return.

  • tnd_noise_distribution – dict, optional: Pre-trained noise distribution

  • lda_tw_dist – dict, optional: Pre-trained topic-word distribution.

  • lda_topics – list of lists, optional: Pre-computed LDA topics.

  • corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.

  • dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.

  • mallet_tnd_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.

  • mallet_lda_path – path to Mallet LDA code, required: Path should be path/to/mallet-lda/bin/mallet.

  • random_seed – int, optional: Seed for random-number generated processes.

  • run – bool, optional: If true, run model on initialization, if data is provided.

  • tnd_workers – int, optional: Number of cores to use for computation of TND.

  • lda_workers – int, optional: Number of cores to use for computation of LDA.