datamining-questions
1. What are the di erences among the three:
(1) boxplot (2) scatter plot (3) Q-Q plot?
2. Assume a base cuboid of 10 dimensions contains only two base cells:
(1) (a1; a2; a3; b4; :::; b19; b20), (2) (b1; b2; b3; :::; b19; b20),
where ai 6= bi for any i. The measure of the cube is count.
(a) How many nonempty aggregated cells a complete cube will con-
tain?
(b) How many nonempty aggregated cells an iceberg cube will con-
tain if the condition of the iceberg cube is count 2″?
(c) How many closed cells in the full cube?
3. Since items have di erent values and expected frequencies of sale, it
is desirable to use group-based minimum support thresholds set up by
users. For example, one may set up a small min support for the group
of diamonds but a rather large one for the group of shoes. Outline an
Apriori-like algorithm that derive the set of frequent items e ciently
in a transaction database.
4. For mining correlated patterns in a transaction database, all con dence
( ) has been used as an interestingness measure. A set of items fA1;A2; :::;Akg
is strongly correlated if
sup(A1;A2; :::;Ak)
max(sup(A1); :::; sup(Ak))
min
1
where min is the minimal all con dence threshold and max(sup(A1); :::sup(Ak))
is the maximal support among that of all the single items
Based on the equation above prove that if current k-itemset cannot
satisfy the constraint, its corresponding (k+1)-itemset cannot satisfy
it either.
5. What are the major di erences among the three:
(1) information gain (2) gain ratio (3) foil-gain
6. What are the major di erences between:
(1) bagging (2) boosting?
7. Given 50 GB data set with 40 attributes each containing 100 distinct
values , and 512 MB main memory in a laptop, outline an e cient
method that constructs decision trees e cientlym, and answer the fol-
lowing questions explicitly:
(a) How many scans of the database does your algorithm take if the
maximal depth of decision tree derived is 5?
(b) How do you use your memory space in your tree induction?