datamining-questions - CA Essay Writers

1. What are the di erences among the three:

(1) boxplot (2) scatter plot (3) Q-Q plot?

2. Assume a base cuboid of 10 dimensions contains only two base cells:

(1) (a1; a2; a3; b4; :::; b19; b20), (2) (b1; b2; b3; :::; b19; b20),

where ai 6= bi for any i. The measure of the cube is count.

(a) How many nonempty aggregated cells a complete cube will con-

tain?

(b) How many nonempty aggregated cells an iceberg cube will con-

tain if the condition of the iceberg cube is count 2″?

3. Since items have di erent values and expected frequencies of sale, it

is desirable to use group-based minimum support thresholds set up by

users. For example, one may set up a small min support for the group

of diamonds but a rather large one for the group of shoes. Outline an

Apriori-like algorithm that derive the set of frequent items e ciently

in a transaction database.

4. For mining correlated patterns in a transaction database, all con dence

( ) has been used as an interestingness measure. A set of items fA1;A2; :::;Akg

is strongly correlated if

sup(A1;A2; :::;Ak)

max(sup(A1); :::; sup(Ak))

min

where min is the minimal all con dence threshold and max(sup(A1); :::sup(Ak))

is the maximal support among that of all the single items

Based on the equation above prove that if current k-itemset cannot

satisfy the constraint, its corresponding (k+1)-itemset cannot satisfy

it either.

5. What are the major di erences among the three:

(1) information gain (2) gain ratio (3) foil-gain

6. What are the major di erences between:

(1) bagging (2) boosting?

7. Given 50 GB data set with 40 attributes each containing 100 distinct

values , and 512 MB main memory in a laptop, outline an e cient

method that constructs decision trees e cientlym, and answer the fol-

lowing questions explicitly:

(a) How many scans of the database does your algorithm take if the

maximal depth of decision tree derived is 5?

(b) How do you use your memory space in your tree induction?