TY - JOUR
T1 - Supervised topic modeling using hierarchical dirichlet process-based inverse regression
T2 - Experiments on e-commerce applications
AU - Li, Weifeng
AU - Yin, Junming
AU - Chen, Hsinchsun
N1 - Funding Information:
This work was supported by the US National Science Foundation under Grant No. SES-1314631 and also under Grant No. DUE-1303362.
Publisher Copyright:
© 1989-2012 IEEE.
PY - 2018/6/1
Y1 - 2018/6/1
N2 - The proliferation of e-commerce calls for mining consumer preferences and opinions from user-generated text. To this end, topic models have been widely adopted to discover the underlying semantic themes (i.e., topics). Supervised topic models have emerged to leverage discovered topics for predicting the response of interest (e.g., product quality and sales). However, supervised topic modeling remains a challenging problem because of the need to prespecify the number of topics, the lack of predictive information in topics, and limited scalability. In this paper, we propose a novel supervised topic model, Hierarchical Dirichlet Process-based Inverse Regression (HDP-IR). HDP-IR characterizes the corpus with a flexible number of topics, which prove to retain as much predictive information as the original corpus. Moreover, we develop an efficient inference algorithm capable of examining large-scale corpora (millions of documents or more). Three experiments were conducted to evaluate the predictive performance over major e-commerce benchmark testbeds of online reviews. Overall, HDP-IR outperformed existing state-of-The-Art supervised topic models. Particularly, retaining sufficient predictive information improved predictive R-squared by over 17.6 percent; having topic structure flexibility contributed to predictive R-squared by at least 4.1 percent. HDP-IR provides an important step for future study on user-generated texts from a topic perspective.
AB - The proliferation of e-commerce calls for mining consumer preferences and opinions from user-generated text. To this end, topic models have been widely adopted to discover the underlying semantic themes (i.e., topics). Supervised topic models have emerged to leverage discovered topics for predicting the response of interest (e.g., product quality and sales). However, supervised topic modeling remains a challenging problem because of the need to prespecify the number of topics, the lack of predictive information in topics, and limited scalability. In this paper, we propose a novel supervised topic model, Hierarchical Dirichlet Process-based Inverse Regression (HDP-IR). HDP-IR characterizes the corpus with a flexible number of topics, which prove to retain as much predictive information as the original corpus. Moreover, we develop an efficient inference algorithm capable of examining large-scale corpora (millions of documents or more). Three experiments were conducted to evaluate the predictive performance over major e-commerce benchmark testbeds of online reviews. Overall, HDP-IR outperformed existing state-of-The-Art supervised topic models. Particularly, retaining sufficient predictive information improved predictive R-squared by over 17.6 percent; having topic structure flexibility contributed to predictive R-squared by at least 4.1 percent. HDP-IR provides an important step for future study on user-generated texts from a topic perspective.
KW - Bayesian nonparametrics
KW - Hierarchical dirichlet process
KW - Sufficient dimension reduction
KW - Topic modeling
KW - Variational inference
UR - http://www.scopus.com/inward/record.url?scp=85039797753&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039797753&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2017.2786727
DO - 10.1109/TKDE.2017.2786727
M3 - Article
AN - SCOPUS:85039797753
SN - 1041-4347
VL - 30
SP - 1192
EP - 1205
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 6
ER -