0
0
0
1. 云栖社区>
2. 博客>
3. 正文

## 基于协同过滤的推荐方法

shiyanjuncn 2016-04-13 15:00:58 浏览3203

• 基于用户的协同过滤推荐
• 基于物品的协同过滤推荐
• 基于模型的协同过滤推荐
• 混合协同过滤推荐

• EuclideanDistanceSimilarity

• PearsonCorrelationSimilarity

• UncenteredCosineSimilarity

• LogLikelihoodSimilarity

Mahout采用了log-likelihood ratio（LLR）的方法计算相似度，对于事件A和事件B，我们考虑两个事件发生的次数，具有如下事件矩阵：

 Event A Everything but A Event B k11 k12 Everything but B k21 k22

k11：事件A与事件B同时发生的次数
k12：B事件发生，A事件未发生
k21：A事件发生，B事件未发生
k22：事件A和事件B都未发生

 `01` `public` `static` `double` `logLikelihoodRatio(``long` `k11, ``long` `k12, ``long` `k21, ``long` `k22) {`
 `02` ` ``Preconditions.checkArgument(k11 >= ``0` `&& k12 >= ``0` `&& k21 >= ``0` `&& k22 >= ``0``);`
 `03` ` ``// note that we have counts here, not probabilities, and that the entropy is not normalized.`
 `04` ` ``double` `rowEntropy = entropy(k11 + k12, k21 + k22);`
 `05` ` ``double` `columnEntropy = entropy(k11 + k21, k12 + k22);`
 `06` ` ``double` `matrixEntropy = entropy(k11, k12, k21, k22);`
 `07` ` ``if` `(rowEntropy + columnEntropy < matrixEntropy) {`
 `08` ` ``// round off error`
 `09` ` ``return` `0.0``;`
 `10` ` ``}`
 `11` ` ``return` `2.0` `* (rowEntropy + columnEntropy - matrixEntropy);`
 `12` `}`

• TanimotoCoefficientSimilarity

Tanimoto相关系数来源于Jaccard距离，它表示两个数据集的相异度，即不相似度量。

• CityBlockSimilarity

• SpearmanCorrelationSimilarity

Spearman相关系数的计算公式，如下所示：

 `01` `final` `DataModel model = ``new` `FileDataModel(``new` `File(``"src/main/resources/u.data"``));`
 `02` `UserSimilarity similarity = ``new` `PearsonCorrelationSimilarity(model);`
 `03` `int` `n = ``20``;`
 `04` `UserNeighborhood neighborhood = ``new` `NearestNUserNeighborhood(n, similarity, model);`
 `05`
 `06` `final` `UserBasedRecommender recommender = ``new` `GenericUserBasedRecommender(model, neighborhood, similarity);`
 `07` `long` `userID = ``200``;`
 `08` `int` `topN = ``10``;`
 `09` `List recommendations = recommender.recommend(userID, topN);`
 `10` `for` `(RecommendedItem recommendation : recommendations) {`
 `11` ` ``System.out.println(recommendation.getItemID() + ``": "` `+ recommendation.getValue());`
 `12` `}`

 `01` `316: 5.0`
 `02` `895: 4.6441307`
 `03` `100: 4.6418004`
 `04` `315: 4.564026`
 `05` `896: 4.5467043`
 `06` `285: 4.5023627`
 `07` `344: 4.475635`
 `08` `346: 4.4689593`
 `09` `272: 4.4235177`
 `10` `750: 4.297509`

 `01` `final` `DataModel model = ``new` `FileDataModel(``new` `File(``"src/main/resources/u.data"``));`
 `02` `ItemSimilarity similarity = ``new` `LogLikelihoodSimilarity(model);`
 `03`
 `04` `final` `ItemBasedRecommender recommender = ``new` `GenericItemBasedRecommender(model, similarity);`
 `05` `long` `userID = ``200``;`
 `06` `int` `topN = ``10``;`
 `07` `List recommendations = recommender.recommend(userID, topN);`
 `08` `for` `(RecommendedItem recommendation : recommendations) {`
 `09` ` ``System.out.println(recommendation.getItemID() + ``": "` `+ recommendation.getValue());`
 `10` `}`

 `01` `1431: 4.6409993`
 `02` `1156: 4.494837`
 `03` `1127: 4.4886928`
 `04` `1234: 4.481482`
 `05` `1294: 4.442889`
 `06` `1122: 4.442692`
 `07` `1654: 4.419746`
 `08` `1593: 4.419684`
 `09` `1595: 4.392633`
 `10` `1596: 4.392633`

Mahout采用了ALS-WR方法实现矩阵分解，目标函数如下公式所示：

• 准备训练集和测试集

 `1` `bin/mahout splitDataset -i /``test``/shiyj/data/ml-20m/ratings.csv -o /``test``/shiyj/data/splitDS -t 0.8 -p 0.2 --tempDir /tmp/mahout/`

• 训练推荐模型

 `1` `bin/mahout parallelALS -i /``test``/shiyj/data/splitDS/trainingSet -o /``test``/shiyj/data/output/als --lambda 0.1 --implicitFeedback ``true` `--alpha 0.065 --numFeatures 10 --numIterations 10 --numThreadsPerSolver 2 --tempDir /tmp/mahout`

 `1` `/test/shiyj/data/output/als/U`
 `2` `/test/shiyj/data/output/als/M`
 `3` `/test/shiyj/data/output/als/userRatings`

• 评估模型

 `1` `bin/mahout evaluateFactorization -i /``test``/shiyj/data/splitDS/probeSet -o /``test``/shiyj/data/output/als/rmse --userFeatures /``test``/shiyj/data/output/als/U --itemFeatures /``test``/shiyj/data/output/als/M --tempDir /tmp/mahout/rmse`

• 推荐

 `1` `bin/mahout recommendfactorized -i /``test``/shiyj/data/output/als/userRatings -o /``test``/shiyj/data/output/als/recommendations --userFeatures /``test``/shiyj/data/output/als/U --itemFeatures /``test``/shiyj/data/output/als/M --numRecommendations 10 --maxRating 5`

 `01` `1 [2571:0.65515196,1214:0.63967377,1210:0.5870833,1206:0.58017606,750:0.53064054,1270:0.52714694,1527:0.492824,5952:0.4770518,480:0.4517607,1580:0.43877804]`
 `02` `2 [1196:0.22062394,1210:0.21882197,1198:0.19199324,1240:0.1805339,2571:0.17224884,1200:0.16999668,1291:0.15613373,858:0.14950913,1097:0.14854312,1197:0.1480078]`
 `03` `3 [1196:0.8909414,1214:0.8081364,1270:0.6871423,924:0.64723164,858:0.6388401,1136:0.6263018,1291:0.58192265,1036:0.56856805,750:0.55387926,296:0.5497311]`
 `04` `4 [457:0.34563774,592:0.3424035,150:0.32874095,590:0.32829174,349:0.3071443,153:0.29904938,316:0.29723072,110:0.28530872,344:0.28501293,434:0.27692935]`
 `05` `5 [1:0.64543045,356:0.55862373,733:0.48568383,32:0.48033717,588:0.46519792,62:0.46215564,590:0.4575655,110:0.44682676,597:0.420567,95:0.41918993]`
 `06` `6 [648:0.45023045,736:0.4386352,95:0.34667525,786:0.3351175,1073:0.32370603,32:0.31021506,1210:0.30251333,608:0.2933381,7:0.28936782,36:0.28653657]`
 `07` `7 [1198:0.52251583,2028:0.51249886,1210:0.5006326,1197:0.48272145,1291:0.46804646,2571:0.45988122,1784:0.45771137,2797:0.4538604,1610:0.45373544,1704:0.4534099]`
 `08` `8 [480:0.86542463,318:0.7133561,434:0.6677318,185:0.6504521,208:0.6451925,161:0.64106977,253:0.6124166,34:0.5435114,586:0.5240867,288:0.50809836]`
 `09` `9 [2858:0.16401465,2762:0.12627697,2571:0.123330824,2997:0.11046047,296:0.10837068,2028:0.10639984,2329:0.103341825,1704:0.10107514,1089:0.100894466,50:0.098862514]`
 `10` `10 [1270:0.32084247,2571:0.2983431,608:0.29643345,1291:0.29354194,780:0.2851402,1036:0.27120706,1214:0.2681493,1197:0.26402688,1097:0.2634927,1200:0.24816594]`

1. 加权的混合（Weighted Hybridization）: 用线性公式将几种不同的推荐按照一定权重组合起来，具体权重的值需要在测试数据集上反复实验，从而达到最好的推荐效果。
2. 切换的混合（Switching Hybridization）：其实对于不同的情况（数据量，系统运行状况，用户和物品的数目等），推荐策略可能有很大的不同，那么切换的混合方式，就是允许在不同的情况下，选择最为合适的推荐机制计算推荐。
3. 分区的混合（Mixed Hybridization）：采用多种推荐机制，并将不同的推荐结果分不同的区显示给用户。
4. 分层的混合（Meta-Level Hybridization）: 采用多种推荐机制，并将一个推荐机制的结果作为另一个的输入，从而综合各个推荐机制的优缺点，得到更加准确的推荐。

• 根据用户的历史行为数据，离线计算推荐模型，得到模型A；
• 为了使推荐满足实时行要求，可以选择使用基于用户的协同过滤推荐/基于物品的协同过滤推荐，基于实时用户行为数据，得到模型B；
• 对于用户的冷启动问题，可以考虑采用统计的方式，找到热门的物品推荐给新用户，得到模型C；
• 对于老用户，可以将模型A和B的结果进行一个排序筛选，得到一个新的Top N推荐列表。

shiyanjuncn
+ 关注