Data preparation
For DeepHBV model training and testing, 1000 bp DNA sequences were extracted from upstream and downstream, respectively, of HBV integration sites as a positive dataset. Each sample was denoted as \(S=({n}_{1},{n}_{2},\dots ,{n}_{2000})\), where \({n}_{\mathrm{i}}\) represents the nucleotide in position i. DNA sequences do not contain HBV integration sites as negative samples. The existence of HBV integration hot spots, which contain several integration events within the 30–100,000 bp range [34], prompted us to select the background area with sufficient distance from known HBV integration sites. The regions within 50,000 bp around the known HBV integration sites in the hg38 reference genome were ignored. A 2000 bp sequence that did not contain HBV integration sites was randomly selected from the remaining regions as negative samples.
The extracted DNA sequences were encoded into one-hot code to calculate the similarity and distance between features in training more accurately. Original DNA sequences were converted to binary matrices of four dimensions, corresponding to one nucleotide type.
Feature extraction
The DeepHBV model first applied convolution and pooling modules to learn and obtain sequence features around HBV integration sites (Additional file 1: Fig. S1). Specifically, the model employed multiple variant convolution kernels for the calculation to obtain different features. A DNA sequence is denoted as \(S = \left({n}_{1},{n}_{2},\dots ,{n}_{2000}\right)\) and further encoded into a binary matrix E. Each binary matrix was entered into the convolution and pooling module for convolution calculations, according to \(X=conv(E)\), which can be denoted as:
$$X_{{k,j}} = \mathop \sum \limits_{{j = 0}}^{{p - 1}} \mathop \sum \limits_{{l = 1}}^{L} W_{{k,j,l}} E_{{l,i + j}}$$
(1)
Here, 1 ≤ k ≤ d, d refers to the number of convolution kernels, 1 ≤ j ≤ n – p + 1, \(j\) refers to the index, p refers to the convolution kernel size, n refers to the input sequence length, and \(W\) refers to the convolution kernel weight.
The convolutional layer activated eigenvectors using a rectified linear unit (REL) after extracting relative eigenvectors and mapping each element on a sparse matrix. Next, the model applies a max-pooling strategy to minimize the dimensions and maximize the predicted information. The final eigenvector \({F}_{\mathrm{c}}\) was then extracted.
The attention mechanism in the DeepHBV model
The attention mechanism was applied in DeepHBV to determine the contribution of each position to the extracted eigenvector \({F}_{\mathrm{c}}\). Each eigenvalue was assigned a weight value in the attention layer, which refers to the contribution level of the convolutional neural network (CNN) in that position.
The output from the convolution-and-pooling module, eigenvector \({F}_{\mathrm{c}}\), is the input of the attention layer, and the output is the weight vector \(W\), which can be denoted as
$$W = att\left( {a_{1} ,a_{2} , \ldots ,a_{q} } \right)$$
(2)
Here, \(att()\) refers to the attention mechanism, \({a}_{i}\) is the eigenvector in the \({i}^{th}\) dimension in the eigenmatrix, and \(W\) refers to the dataset containing the contribution values of each position in the eigenmatrix extracted by the convolution-and-pooling module.
All contribution values were normalized to achieve a dense eigenvector matrix, which is denoted as \({F}_{a}\):
$$F_{a} = \mathop \sum \limits_{{j = 1}}^{q} a_{j} v_{j}$$
(3)
where \({a}_{j}\) refers to the relevant normalization value, and \({v}_{j}\) refers to the eigenvector at position \(j\) of the input eigenmatrix. Each position refers to an extracted eigenvector in each convolution kernel.
The convolution-pooling module and the attention mechanism module must be combined in the model prediction. In other words, eigenvector \({F}_{c}\) and the relative eigenvalue \({F}_{a}\) should work together in predicting HBV integration sites.
The values in the eigenvector \({F}_{c}\) were linearly mapped to a new vector, \({F}_{v}\), which is
$$F_{v} = \left( {dense\left( {flatten\left( F \right)_{{\text{c}}} } \right)} \right)$$
(4)
In this step, the flattened layer performs the function \(flatten()\) to reduce the dimension and concatenate data; the dense layer performed function \(dense()\) to map dimension-reduced data to a single value. Then, the \({F}_{v}\) and \({F}_{a}\) concatenated vector entered the linear prediction classifier to calculate the probability that HBV integration occurred within the current sequence, as follows:
$$P = sigmoid\left( {concat\left( {F_{a} ,F_{v} } \right)} \right)$$
(5)
where \(P\) is the predicted value, \(sigmoid()\) refers to the activation function acting as a classifier in the final output, \(concat()\) refers to the concatenation operation.
At the same time, the output eigenvector \({F}_{c}\) from the convolution-and-pooling module serves as the input and executes the attention mechanism where the weight vector \(W\) can be described as:
$$W = att\left( {a_{1} ,a_{2} , \ldots ,a_{q} } \right)$$
(6)
Here, \(W\) refers to the dataset containing the contribution values of each position in the eigenmatrix extracted by the convolution-and-pooling module, \(att()\) refers to the attention mechanism, and \({a}_{i}\) refers to the eigenvector in the ith dimension in the eigenmatrix.
DeepHBV model evaluation
After each parameter in DeepHBV was confirmed (Additional file 4: Table S1), the DeepHBV deep learning neural network model was trained using binary cross-entropy. The loss function of DeepHBV can be defined as:
$$Loss = - \mathop \sum \limits_{i} y_{i} \log \left( P \right) + \left( {1 - y_{i} } \right)\log \left( {1 - P} \right)$$
(7)
where \({y}_{i}\) is the prediction value, \(P\) is the binary value of that sequence (in this dataset, positive samples were labeled as 1, and negative samples were labeled as 0).
To evaluate the best output of the DeepHBV model, a tenfold cross-validation was adopted. The confusion matrix, which included true positive, true negative, false positive, and false negative, followed by accuracy, specificity, sensitivity, AUROC, AUPR, Mathews’ correlation coefficient, and F-1 score, were adopted.
The DeepHBV model adapted Tensorflow1.13.1, scikit-learn0.24 [35] by NVIDIA Tesla V100-PCIE-32G (NVIDIA Corporation, California, USA).