Topic-Noise Models

gdtm.models.tnd module

class gdtm.models.tnd.TND(dataset=None, k=30, alpha=50, beta0=0.01, beta1=25, noise_words_max=200, iterations=1000, top_words=20, topic_word_distribution=None, noise_distribution=None, corpus=None, dictionary=None, mallet_path=None, random_seed=1824, run=True, workers=4)

Bases: object

Topic-Noise Discriminator (TND). The original Topic-Noise Model, this model is best used in an ensemble with other models, such as LDA (NLDA), or the Guided Topic Model (GTM).

Parameters

dataset – list of lists, required.
k – int, optional: Number of topics to compute in TND.
alpha – int, optional: Alpha parameter of TND.
beta0 – float, optional: Beta_0 parameter of TND.
beta1 – int, optional Beta_1 (skew) parameter of TND.
noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
iterations – int, optional: Number of training iterations for TND.
top_words – int, optional: Number of words per topic to return.
topic_word_distribution – dict, optional: Pre-trained topic-word distribution.
noise_distribution – dict, optional: Pre-trained noise distribution.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, provided data is provided.
workers – int, optional: Number of cores to use for computation of TND.

get_noise_distribution(tnd_noise_words_max=None)

takes self.noise_distribution and tnd_noise_words_max returns a list of (noise word, frequency) tuples ranked by frequency

Parameters: tnd_noise_words_max – number of words to be returned
Returns: list of (noise word, frequency) tuples

get_topics(top_words=None)

Takes top_words and self.topics, returns a list of topic lists of length top_words

Parameters: top_words – number of words per topic
Returns: list of topic lists

gdtm.models.nlda module

class gdtm.models.nlda.NLDA(dataset=None, tnd_k=30, tnd_alpha=50, tnd_beta0=0.01, tnd_beta1=25, tnd_noise_words_max=200, tnd_iterations=1000, lda_iterations=1000, lda_k=30, phi=10, topic_depth=100, top_words=20, tnd_noise_distribution=None, lda_tw_dist=None, lda_topics=None, corpus=None, dictionary=None, mallet_tnd_path=None, mallet_lda_path=None, random_seed=1824, run=True, tnd_workers=4, lda_workers=4)

Bases: object

Noiseless Latent Dirichlet Allocation (NLDA). An ensemble topic-noise model consisting of the noise distribution from TND and the topic-word distribution from LDA. Input the raw data and compute the whole model, or input pre-computed distributions for faster inference.

Parameters

dataset – list of lists, required.
tnd_k – int, optional: Number of topics to compute in TND.
tnd_alpha – int, optional: Alpha parameter of TND.
tnd_beta0 – float, optional: Beta_0 parameter of TND.
tnd_beta1 – int, optional Beta_1 (skew) parameter of TND.
tnd_noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
tnd_iterations – int, optional: Number of training iterations for TND.
lda_iterations – int, optional: Number of training iterations for LDA.
lda_k – int, optional: Number of topics to compute in LDA.
phi – int, optional: Topic weighting for noise filtering step.
topic_depth – int, optional: Number of most probable words per topic to consider for replacement in noise filtering step.
top_words – int, optional: Number of words per topic to return.
tnd_noise_distribution – dict, optional: Pre-trained noise distribution
lda_tw_dist – dict, optional: Pre-trained topic-word distribution.
lda_topics – list of lists, optional: Pre-computed LDA topics.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_tnd_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
mallet_lda_path – path to Mallet LDA code, required: Path should be path/to/mallet-lda/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, if data is provided.
tnd_workers – int, optional: Number of cores to use for computation of TND.
lda_workers – int, optional: Number of cores to use for computation of LDA.

get_noise_distribution(tnd_noise_words_max=None)

takes self.tnd_noise_distribution and tnd_noise_words_max returns a list of (noise word, frequency) tuples ranked by frequency

Parameters: tnd_noise_words_max – number of words to be returned
Returns: list of (noise word, frequency) tuples

get_topics(top_words=None)

takes top_words and self.topics, returns a list of topic lists of length top_words

Parameters: top_words – number of words per topic
Returns: list of topic lists

Embedded Topic-Noise Models

gdtm.models.etnd module

class gdtm.models.etnd.eTND(dataset=None, k=30, alpha=50, beta0=0.01, beta1=25, noise_words_max=200, iterations=1000, top_words=20, topic_word_distribution=None, noise_distribution=None, corpus=None, dictionary=None, mallet_path=None, embedding_path=None, closest_x_words=3, random_seed=1824, run=True, workers=4)

Bases: gdtm.models.tnd.TND

Embedded Topic-Noise Discriminator (eTND). The embedded version of TND, this model is best used in an ensemble with other models, such as LDA (NLDA), or the Guided Topic Model (GTM).

Parameters

dataset – list of lists, required.
k – int, optional: Number of topics to compute in TND.
alpha – int, optional: Alpha parameter of TND.
beta0 – float, optional: Beta_0 parameter of TND.
beta1 – int, optional Beta_1 (skew) parameter of TND.
noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
embedding_path – filepath, required: path to trained word embedding vectors.
closest_x_words – int, optional: The number of words to sample from the word embedding space each time a word is determined to be a noise word.
iterations – int, optional: Number of training iterations for TND.
top_words – int, optional: Number of words per topic to return.
topic_word_distribution – dict, optional: Pre-trained topic-word distribution.
noise_distribution – dict, optional: Pre-trained noise distribution.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, provided data is provided.
workers – int, optional: Number of cores to use for computation of TND.

gdtm.models.enlda module

class gdtm.models.enlda.eNLDA(dataset=None, tnd_k=30, tnd_alpha=50, tnd_beta0=0.01, tnd_beta1=25, tnd_noise_words_max=200, tnd_iterations=1000, lda_iterations=1000, lda_k=30, phi=10, topic_depth=100, top_words=20, tnd_noise_distribution=None, lda_tw_dist=None, lda_topics=None, corpus=None, dictionary=None, mallet_tnd_path=None, mallet_lda_path=None, embedding_path=None, closest_x_words=3, random_seed=1824, run=True, tnd_workers=4, lda_workers=4)

Bases: gdtm.models.nlda.NLDA

Embedded Noiseless Latent Dirichlet Allocation (eNLDA). An ensemble topic-noise model consisting of the noise distribution from TND and the topic-word distribution from LDA. Input the raw data and compute the whole model, or input pre-computed distributions for faster inference. Uses word embedding vectors to enhance the TND noise distribution.

Parameters

dataset – list of lists, required.
tnd_k – int, optional: Number of topics to compute in TND.
tnd_alpha – int, optional: Alpha parameter of TND.
tnd_beta0 – float, optional: Beta_0 parameter of TND.
tnd_beta1 – int, optional Beta_1 (skew) parameter of TND.
tnd_noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
tnd_iterations – int, optional: Number of training iterations for TND.
lda_iterations – int, optional: Number of training iterations for LDA.
lda_k – int, optional: Number of topics to compute in LDA.
phi – int, optional: Topic weighting for noise filtering step.
topic_depth – int, optional: Number of most probable words per topic to consider for replacement in noise filtering step.
embedding_path – filepath, required: Path to trained word embedding vectors.
closest_x_words – int, optional: The number of words to sample from the word embedding space each time a word is determined to be a noise word.
top_words – int, optional: Number of words per topic to return.
tnd_noise_distribution – dict, optional: Pre-trained noise distribution
lda_tw_dist – dict, optional: Pre-trained topic-word distribution.
lda_topics – list of lists, optional: Pre-computed LDA topics.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_tnd_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
mallet_lda_path – path to Mallet LDA code, required: Path should be path/to/mallet-lda/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, if data is provided.
tnd_workers – int, optional: Number of cores to use for computation of TND.
lda_workers – int, optional: Number of cores to use for computation of LDA.