The conventional way to evaluate the performance of machine learning models intrusion detection systems (IDS) is by using the same dataset to train and test. This method might lead to the bias from the computer network where the traffic is generated. Because of that, the applicability of the learned models might not be adequately evaluated. We argued in Al-Riyami et al. (ACM, pp 2195-2197 ) that a better way is to use cross-datasets evaluation, where we use two different datasets for training and testing. Both datasets should be generated from various networks. Using this method as it was shown in Al-Riyami et al. (ACM, pp 2195-2197 ) may lead to a significant drop in the performance of the learned model. This indicates that the models learn very little knowledge about the intrusion, which would be transferable from one setting to another. The reasons for such behaviour were not fully understood in Al-Riyami et al. (ACM, pp 2195-2197 ). In this paper, we investigate the problem and show that the main reason is the different definitions of the same feature in both models. We propose the correction and further empirically investigate cross-datasets evaluation for various machine learning methods. Further, we explored cross-dataset evaluation in the multiclass classification of attacks, and we show for most models that learning traffic normality is more robust than learning intrusions.