Abstract
We propose a novel, accurate, and explain-
able recommender model (BENEFICT) that
addresses two drawbacks that most review-
based recommender systems face. First is
their utilization of traditional word embed-
dings that could influence prediction perfor-
mance due to their inability to model the
word semantics’ dynamic characteristic. Sec-
ond is their black-box nature that makes the
explanations behind every prediction obscure.
Our model uniquely integrates three key ele-
ments: BERT, multilayer perceptron, and max-
imum subarray problem to derive contextual-
ized review features, model user-item interac-
tions, and generate explanations, respectively.
Our experiments show that BENEFICT consis-
tently outperforms other state-of-the-art mod-
els by an average improvement gain of nearly
7%. Based on the human judges’ assess-
ment, the BENEFICT-produced explanations
can capture the essence of the customer’s pref-
erence and help future customers make pur-
chasing decisions. To the best of our knowl-
edge, our model is one of the first recom-
mender models to utilize BERT for neural col-
laborative filtering.
1 Introduction
In recommender systems research, collaborative
filtering (CF) is the dominant state-of-the-art rec-
ommendation model, which primarily focuses on
learning accurate representations of users (user
preferences) and items (item characteristics) (Chen
et al., 2018; Tay et al., 2018). The earliest rec-
ommender models learned these representations
based on user-given numeric ratings that each item
received (Mnih and Salakhutdinov, 2008; Koren
et al., 2009). However, ratings, which are values
on a single discrete scale, oversimplify user prefer-
ences and item characteristics (Musto et al., 2017).
The large amount of users and items in a typical
online platform consequently results in a highly
sparse rating matrix, making it hard to learn accu-
rate representations (Zheng et al., 2017).
To alleviate these issues, review texts have in-
stead been utilized to model such representations
for subsequent recommendation and rating predic-
tion, and this approach has attracted growing at-
tention in research (Catherine and Cohen, 2017;
Zheng et al., 2017). The main advantage of reviews
as the source of features is that they can cover user
opinions’ multi-faceted substance. Because users
can explain their reasons underlying their given
ratings, reviews contain a large amount of latent
information that is both rich and valuable, and that
cannot be otherwise obtained from ratings alone
(Chen et al., 2018; Wang et al., 2019). Recently,
models that incorporate user reviews have yielded
state-of-the-art performances (Zheng et al., 2017;
Chen et al., 2018). These approaches learn user
and item representations by using traditional word
embeddings (e.g., word2vec, GloVe) to map each
word in the review into its corresponding vector.
The review is transformed into an embedded matrix
before being fed to a convolutional neural network
(CNN) (Chen et al., 2018). CNNs have been shown
to effectively model reviews and have illustrated
outstanding results in numerous natural language
processing tasks (Wang et al., 2018a).
Nevertheless, there are drawbacks that most
review-based recommender models experience.
First is the utilization of traditional or mainstream
word embeddings to learn review features. Their
static nature is a hindrance, as each word sense is as-
sociated with the same embedding regardless of the
context. In other words, such embeddings cannot
identify the dynamic nature of each word’s seman-
tics. For review-based recommenders, this could be
an issue in modeling users and items, which could,
in turn, affect recommendation performance (Pile-
hvar and Camacho-Collados, 2019). Also, once a
CNN is fed with the matrix of word embeddings,
the word frequency information of contextual features, said to be crucial for modeling reviews, will
be lost (Wang et al., 2018a).
Another drawback is the inherent black-box nature of deep learning-based models that makes
the explanations behind every prediction obscure
(Ribeiro et al., 2016; Wang et al., 2018b). The complex architecture of hidden layers has opaqued the
models’ internal decision-making processes (Peake
and Wang, 2018). Providing explanations could
help persuade users to make decisions and develop
trust in a recommender system (Zhang et al., 2014;
Ribeiro et al., 2016; Costa et al., 2018; Peake and
Wang, 2018). However, this leads us to a dilemma,
i.e., a trade-off between accuracy and explainability.
Usually, the most accurate models are inherently
complicated, non-transparent, and unexplainable
(Zhang and Chen, 2018). The same is also true
for explainable and straightforward methods that
sacrifice accuracy. Formulating models that are
both explainable and accurate is a challenging yet
critical research agenda for the machine learning
community to ensure that we derive benefits from
machine learning fairly and responsibly (Peake and
Wang, 2018).
In this paper, we propose a unique model:
BERT-Based Neural Collaborative Filtering and
Fixed-Length Contiguous Tokens Explanation
(BENEFICT). Our model learns user and item representations simultaneously using two parallel networks. To address the first drawback, we incorporate BERT as a key component in each parallel network. BERT affords us to extract more meaningful,
contextualized features adaptable to arbitrary contexts; such features cannot be derived from mainstream word embeddings (Pilehvar and CamachoCollados, 2019; Zakbik et al., 2019). BERT can
also retain the word frequency information that
makes CNN an unnecessary component of our
model. Once user and item representations are
learned, they are concatenated together in a shared
hidden space before being finally fed to an optimal
stack of multilayer perceptron (MLP) layers that
serve as BENEFICT’s interaction function.
To address the second drawback, we introduce
a novel component in our model that integrates
BERT’s self-attention and an implementation of the
fixed-length maximum subarray problem (MSP),
which is considered to be a classic computer science problem. BERT applies self-attention in
each encoder layer that consequently produces selfattention weights for each token. These are passed
to the successive encoder layers through feedforward networks. We argue that these self-attention
weights can be the basis for explaining rating predictions. Based on this premise, MSP then selects
a segment or subarray of consecutive tokens that
has the maximum possible sum of self-attention
weights.
1.1 Contributions
Our work aims to fill the research gap by implementing a solution that is both accurate and explainable. We propose a novel model that uniquely
integrates three vital elements, i.e., BERT, MLP,
and MSP, to derive review features, model useritem interactions, and produce possible explanations. To the best of our knowledge, BENEFICT
is one of the first review-based recommender models to utilize BERT for neural CF. Also, to the
extent of our knowledge, BENEFICT is one of
the first models to repurpose a portion of the Neural Collaborative Filtering (NCF) framework (He
et al., 2017) as the user-item interaction function
for review-based, explicit CF. Moreover, our experiments have demonstrated that our model achieves
better rating prediction results than the other stateof-the-art recommender models.
2 Related Work and Concepts
Designing a CF model involves two crucial steps:
learning user and item representations and modeling user-item interactions based on those representations (He et al., 2018). Before the advancements
provided by neural networks, matrix factorization
(MF) was the dominant model representing users
and items as vectors of latent factors (called embeddings) and models user-item interactions using the
inner product operation. The said operation leads
to poor performance because it is sub-optimal for
learning rich yet complicated patterns from realworld data (He et al., 2018). To address this scenario, neural networks (NN) have been integrated
into recommender architectures. One of the initial
works that have laid the foundation in employing
NN for CF is NCF (He et al., 2017). Their framework, originally implemented for rating-based, implicit CF, learns non-linear interactions between
users and items by employing MLP layers as their
interaction function, granting it a high degree of
non-linearity and flexibility to learn meaningful
interactions. Two common designs have emerged
when it comes to leveraging MLP layers: placing
an MLP above either the concatenated user-item
embeddings (He et al., 2017; Bai et al., 2017) or the
element-wise product of user and item embeddings
(Zhang et al., 2017; Wang et al., 2017).
As far as rating prediction is concerned, two
notable recommender models have yielded significant state-of-the-art prediction performances.
DeepCoNN is the first deep model that represents
users and items from reviews jointly (Zheng et al.,
2017). It consists of two parallel, CNN-powered
networks. One network learns user behavior by
examining all reviews that he has written, and the
other network models item properties by exploring all reviews that it has received. A shared layer
connects these two networks, and factorization machines capture user-item interactions. The second
model is NARRE, which shares certain similarities with DeepCoNN. NARRE is also composed of
two parallel networks for user and item modeling
with respective CNNs to process reviews (Chen
et al., 2018). Rather than concatenating reviews to
one long sequence the same way that DeepCoNN
does, their model introduces an attention mechanism that learns review-level usefulness in the form
of attention weights. These weights are integrated
into user and item representations to enhance the
embedding quality and the subsequent prediction
accuracy. Both DeepCoNN and NARRE employ
traditional word embeddings.
Other relevant studies have claimed to provide
explanations for recommendations such as EFM
(Zhang et al., 2014), sCVR (Ren et al., 2017), and
TriRank (He et al., 2015). These models initially
extract aspects and opinions by performing phraselevel sentiment analysis on reviews. Afterward,
they generate feature-level explanations according
to product features that correspond to user interests
(Chen et al., 2018). However, these models have
some limitations; manual preprocessing is required
for sentiment analysis and feature extraction, and
the explanations are simple extraction of words or
phrases from the review text (Zhang et al., 2014;
Ren et al., 2017). This also has the unintended
effect of distorting the reviews’ original meaning
(Ribeiro et al., 2016; Chen et al., 2018). Another
limitation is that textual similarity is solely based
on lexical similarity; this implies that semantic
meaning is ignored (Zheng et al., 2017; Chen et al.,
2018).
0 Comments