Posted on 

Note | Robust detection of link communities in large social networks by exploiting link semantics - PART 1

Paper

This is the first paper I read, and I use this blog to record my thoughts and organize during my study.

1.1

I will introduce the overarching content of the whole paper from four aspects. They are the background & existing methods, models & methods, experiments and conclusions & discussions.

Background

1.2

The development of social network

  • Networking is becoming more and more important for people in all fields around the world. With the development of social networks, more and more information begins to gather on the Internet.

  • The analysis of these big data can make us more familiar with the deep structure of the network, understand user behavior and future trends.

  • One of the most important issues in social networks is community discovery, where we can provide users with personalized recommendations and identify abnormal behaviors.

  • So-called “community discovery” is the division of user nodes in the social network into different groups. The user nodes of each group share certain characteristics.

Existing methods

1.3

  • We usually use a graph to represent a social network. Where points represent user nodes, where edges represent connections between users.

  • At first, the algorithm found by people’s community is based on the topology of the network, that is, the number of edges between each community divided by us is minimum, and the number of edges between points within the community is as large as possible

  • Later, the algorithm of community discovery is improved. We divide the community by node content, even if the node content in the same community is as similar as possible. Community discovery through node content can greatly improve the efficiency of our community discovery.

  • We also found that the links between users, the edges in the graph, contain a lot of information.

1.4

This is a graphic representation of how our approach differs from other approaches. The graph on the right is the algorithm diagram of community discovery based on node content, and the graph on the left is the graph of community discovery based on link content.

We can see the problem with other existing methods:

  1. Only node content is considered. Taking node content into account for community discovery can sometimes be very efficient. Taking the community discovery of Weibo users as an example, when the content we provide is user profile, it is very feasible to carry out community discovery based on node content. However, when the content we provide is messages sent between users, this is actually “linked content” and we need to convert the linked content into node content. For example, all messages sent by user A count as node content for user A. This will inevitably lead to inaccurate community division.

  2. Assume that the user nodes of the network topology community and the node content community are the same. Two users are closely connected and form a topological community, but their chat content may be very diverse, and the two people may be divided into different node content communities. In this case, the efficiency of the existing method community discovery will be reduced.

  3. Each community has only one topic. For example, the figure on the right mixes Music and Movies together as one topic, while our method (on the left) contains two topics.

  4. Use only single words for community tagging. Sometimes we may not know what we mean. Our method, on the other hand, uses sentences to label, which is easy to understand.

1.5

The Model and Method

Overview

1.6

Detailed Analysis

Let’s start by looking at some of the factors we need to consider in community discovery:

  • Topology Angle: node, link

  • Content Angle: words, sentences, topics

  • Community and Topic clusters

Variable

1.7

1.8

1.9

1.10

The relationship between variables are shown below.

1.11

1.12

The left part of the figure is community discovery based on topology, and the right part is community discovery based on node content.

Now that our model is set up, our goals are as follows:

1.13

Algorithm

The overall idea of our algorithm is as follows: First, we divide all the nodes in the network into different communities (E-step) according to a certain standard, and then we extract the keywords in each community for community annotation. (M - step)

We iterated over and over again through supervised learning based on the annotations and more accurate division of the community.

In the following, we apply the idea of maximum likelihood to the EM algorithm.

E-step:

1.14

1.15

1.16

The variable we expect is , which represents the community to which the link is assigned.

Now p is equal to Jensen’s inequality.

M-step:

1.17

1.18

1.19

Pseudo code is shown below.

1.20