Topic-Noise Models
gdtm.models.tnd module
- class gdtm.models.tnd.TND(dataset=None, k=30, alpha=50, beta0=0.01, beta1=25, noise_words_max=200, iterations=1000, top_words=20, topic_word_distribution=None, noise_distribution=None, corpus=None, dictionary=None, mallet_path=None, random_seed=1824, run=True, workers=4)
Bases:
object
Topic-Noise Discriminator (TND). The original Topic-Noise Model, this model is best used in an ensemble with other models, such as LDA (NLDA), or the Guided Topic Model (GTM).
- Parameters
dataset – list of lists, required.
k – int, optional: Number of topics to compute in TND.
alpha – int, optional: Alpha parameter of TND.
beta0 – float, optional: Beta_0 parameter of TND.
beta1 – int, optional Beta_1 (skew) parameter of TND.
noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
iterations – int, optional: Number of training iterations for TND.
top_words – int, optional: Number of words per topic to return.
topic_word_distribution – dict, optional: Pre-trained topic-word distribution.
noise_distribution – dict, optional: Pre-trained noise distribution.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, provided data is provided.
workers – int, optional: Number of cores to use for computation of TND.
- get_noise_distribution(tnd_noise_words_max=None)
takes self.noise_distribution and tnd_noise_words_max returns a list of (noise word, frequency) tuples ranked by frequency
- Parameters
tnd_noise_words_max – number of words to be returned
- Returns
list of (noise word, frequency) tuples
- get_topics(top_words=None)
Takes top_words and self.topics, returns a list of topic lists of length top_words
- Parameters
top_words – number of words per topic
- Returns
list of topic lists
gdtm.models.nlda module
- class gdtm.models.nlda.NLDA(dataset=None, tnd_k=30, tnd_alpha=50, tnd_beta0=0.01, tnd_beta1=25, tnd_noise_words_max=200, tnd_iterations=1000, lda_iterations=1000, lda_k=30, phi=10, topic_depth=100, top_words=20, tnd_noise_distribution=None, lda_tw_dist=None, lda_topics=None, corpus=None, dictionary=None, mallet_tnd_path=None, mallet_lda_path=None, random_seed=1824, run=True, tnd_workers=4, lda_workers=4)
Bases:
object
Noiseless Latent Dirichlet Allocation (NLDA). An ensemble topic-noise model consisting of the noise distribution from TND and the topic-word distribution from LDA. Input the raw data and compute the whole model, or input pre-computed distributions for faster inference.
- Parameters
dataset – list of lists, required.
tnd_k – int, optional: Number of topics to compute in TND.
tnd_alpha – int, optional: Alpha parameter of TND.
tnd_beta0 – float, optional: Beta_0 parameter of TND.
tnd_beta1 – int, optional Beta_1 (skew) parameter of TND.
tnd_noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
tnd_iterations – int, optional: Number of training iterations for TND.
lda_iterations – int, optional: Number of training iterations for LDA.
lda_k – int, optional: Number of topics to compute in LDA.
phi – int, optional: Topic weighting for noise filtering step.
topic_depth – int, optional: Number of most probable words per topic to consider for replacement in noise filtering step.
top_words – int, optional: Number of words per topic to return.
tnd_noise_distribution – dict, optional: Pre-trained noise distribution
lda_tw_dist – dict, optional: Pre-trained topic-word distribution.
lda_topics – list of lists, optional: Pre-computed LDA topics.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_tnd_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
mallet_lda_path – path to Mallet LDA code, required: Path should be path/to/mallet-lda/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, if data is provided.
tnd_workers – int, optional: Number of cores to use for computation of TND.
lda_workers – int, optional: Number of cores to use for computation of LDA.
- get_noise_distribution(tnd_noise_words_max=None)
takes self.tnd_noise_distribution and tnd_noise_words_max returns a list of (noise word, frequency) tuples ranked by frequency
- Parameters
tnd_noise_words_max – number of words to be returned
- Returns
list of (noise word, frequency) tuples
- get_topics(top_words=None)
takes top_words and self.topics, returns a list of topic lists of length top_words
- Parameters
top_words – number of words per topic
- Returns
list of topic lists
Embedded Topic-Noise Models
gdtm.models.etnd module
- class gdtm.models.etnd.eTND(dataset=None, k=30, alpha=50, beta0=0.01, beta1=25, noise_words_max=200, iterations=1000, top_words=20, topic_word_distribution=None, noise_distribution=None, corpus=None, dictionary=None, mallet_path=None, embedding_path=None, closest_x_words=3, random_seed=1824, run=True, workers=4)
Bases:
gdtm.models.tnd.TND
Embedded Topic-Noise Discriminator (eTND). The embedded version of TND, this model is best used in an ensemble with other models, such as LDA (NLDA), or the Guided Topic Model (GTM).
- Parameters
dataset – list of lists, required.
k – int, optional: Number of topics to compute in TND.
alpha – int, optional: Alpha parameter of TND.
beta0 – float, optional: Beta_0 parameter of TND.
beta1 – int, optional Beta_1 (skew) parameter of TND.
noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
embedding_path – filepath, required: path to trained word embedding vectors.
closest_x_words – int, optional: The number of words to sample from the word embedding space each time a word is determined to be a noise word.
iterations – int, optional: Number of training iterations for TND.
top_words – int, optional: Number of words per topic to return.
topic_word_distribution – dict, optional: Pre-trained topic-word distribution.
noise_distribution – dict, optional: Pre-trained noise distribution.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, provided data is provided.
workers – int, optional: Number of cores to use for computation of TND.
gdtm.models.enlda module
- class gdtm.models.enlda.eNLDA(dataset=None, tnd_k=30, tnd_alpha=50, tnd_beta0=0.01, tnd_beta1=25, tnd_noise_words_max=200, tnd_iterations=1000, lda_iterations=1000, lda_k=30, phi=10, topic_depth=100, top_words=20, tnd_noise_distribution=None, lda_tw_dist=None, lda_topics=None, corpus=None, dictionary=None, mallet_tnd_path=None, mallet_lda_path=None, embedding_path=None, closest_x_words=3, random_seed=1824, run=True, tnd_workers=4, lda_workers=4)
Bases:
gdtm.models.nlda.NLDA
Embedded Noiseless Latent Dirichlet Allocation (eNLDA). An ensemble topic-noise model consisting of the noise distribution from TND and the topic-word distribution from LDA. Input the raw data and compute the whole model, or input pre-computed distributions for faster inference. Uses word embedding vectors to enhance the TND noise distribution.
- Parameters
dataset – list of lists, required.
tnd_k – int, optional: Number of topics to compute in TND.
tnd_alpha – int, optional: Alpha parameter of TND.
tnd_beta0 – float, optional: Beta_0 parameter of TND.
tnd_beta1 – int, optional Beta_1 (skew) parameter of TND.
tnd_noise_words_max – int, optional: Number of noise words to save when saving the distribution to a file. The top noise_words_max most probable noise words will be saved.
tnd_iterations – int, optional: Number of training iterations for TND.
lda_iterations – int, optional: Number of training iterations for LDA.
lda_k – int, optional: Number of topics to compute in LDA.
phi – int, optional: Topic weighting for noise filtering step.
topic_depth – int, optional: Number of most probable words per topic to consider for replacement in noise filtering step.
embedding_path – filepath, required: Path to trained word embedding vectors.
closest_x_words – int, optional: The number of words to sample from the word embedding space each time a word is determined to be a noise word.
top_words – int, optional: Number of words per topic to return.
tnd_noise_distribution – dict, optional: Pre-trained noise distribution
lda_tw_dist – dict, optional: Pre-trained topic-word distribution.
lda_topics – list of lists, optional: Pre-computed LDA topics.
corpus – Gensim object, optional: Formatted documents for use in model. Automatically computed if not provided.
dictionary – Gensim object, optional: Formatted word mapping for use in model. Automatically computed if not provided.
mallet_tnd_path – path to Mallet TND code, required: Path should be path/to/mallet-tnd/bin/mallet.
mallet_lda_path – path to Mallet LDA code, required: Path should be path/to/mallet-lda/bin/mallet.
random_seed – int, optional: Seed for random-number generated processes.
run – bool, optional: If true, run model on initialization, if data is provided.
tnd_workers – int, optional: Number of cores to use for computation of TND.
lda_workers – int, optional: Number of cores to use for computation of LDA.