Machine learning is reaching notable success when solving complex tasks in many fields. This course serves as in introduction to basic machine learning concepts and techniques, focusing both on the theoretical foundation, and on implementation and utilization of machine learning algorithms in Python programming language. High attention is paid to the ability of application of the machine learning techniques on practical tasks, in which the students try to devise a solution with highest performance.
Python programming skills are required, together with basic probability theory knowledge.
Official name: Introduction to Machine Learning with Python
SIS code: NPFL129
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Instructors: Jindřich Libovický (lecture),
Zdeněk Kasner,
Tomáš Musil (practicals),
Milan Straka (assignments & ReCodEx),
Radovan Haluška, Tymur Kotkov, Matej Straka (teaching assistants)
This course is also part of the inter-university programme prg.ai Minor. It pools the best of AI education in Prague to provide students with a deeper and broader insight into the field of artificial intelligence. More information is available at prg.ai/minor.
All lectures and practicals will be recorded and available on this website.
After this course students should…
1. Introduction to Machine Learning Slides PDF Slides linear_regression_manual linear_regression_features CS Lecture EN Lecture CS Practicals EN Practicals Questions
2. Linear Regression, SGD Slides PDF Slides CS Lecture EN Practicals linear_regression_l2 linear_regression_sgd feature_engineering rental_competition Questions
3. Peceptron, Logistic Regression Slides PDF Slides CS Lecture EN Lecture CS Practicals EN Practicals perceptron logistic_regression_sgd grid_search thyroid_competition Questions
4. Multiclass Logistic Regression, Multilayer Perceptron Slides PDF Slides
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
The lecture content, including references to some additional study materials. The main study material is the Pattern Recognition and Machine Learning by Christopher Bishop, referred to as PRML.
Note that the topics in italics are not required for the exam.
Sep 30 Slides PDF Slides linear_regression_manual linear_regression_features CS Lecture EN Lecture CS Practicals EN Practicals Questions
Learning objectives. After the lecture you should be able to…
Covered topics and where to find more:
After the lecture: short and non-comprehensive recap quiz.
Oct 7 Slides PDF Slides CS Lecture EN Practicals linear_regression_l2 linear_regression_sgd feature_engineering rental_competition Questions
Learning objectives. After the lecture you should be able to
Covered topics and where to find more:
Due to technical issues when processing the video from the English lecture, the video is unavailable. Please watch the video from the previous runs of the course.
After the lecture: short and non-comprehensive recap quiz.
Oct 14 Slides PDF Slides CS Lecture EN Lecture CS Practicals EN Practicals perceptron logistic_regression_sgd grid_search thyroid_competition Questions
Learning objectives. After the lecture you should be able to
Covered topics and where to find more:
After the lecture: short and non-comprehensive recap quiz.
Oct 21 Slides PDF Slides
Learning objectives. After the lecture you should be able to
Covered topics and where to find more:
To pass the practicals, you need to obtain at least 70 points, excluding the bonus points. Note that up to 40 points above 70 (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 105 points (not including the bonus points) will be available.
The tasks are evaluated automatically using the ReCodEx Code Examiner.
The evaluation is performed using Python 3.11, scikit-learn 1.5.2, numpy 2.1.1, scipy 1.14.1, pandas 2.2.2, and matplotlib 3.9.2. You should install the exact version of these packages yourselves.
Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are allowed to share code and submit identical solutions.
Deadline: Oct 14, 22:00 3 points
Starting with the linear_regression_manual.py template, solve a linear regression problem using the algorithm from the lecture which explicitly computes the matrix inversion. Then compute root mean square error on the test set.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 linear_regression_manual.py --test_size=0.1
52.38
python3 linear_regression_manual.py --test_size=0.5
54.58
python3 linear_regression_manual.py --test_size=0.9
59.46
Deadline: Oct 14, 22:00 3 points
Starting with the
linear_regression_features.py
template, use scikit-learn
to train a model of a 1D curve.
Try using a concatenation of features for from 1 to a given range, and report RMSE of every such configuration.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 linear_regression_features.py --data_size=10 --test_size=5 --range=6
Maximum feature order 1: 0.74 RMSE
Maximum feature order 2: 1.87 RMSE
Maximum feature order 3: 0.53 RMSE
Maximum feature order 4: 4.52 RMSE
Maximum feature order 5: 1.70 RMSE
Maximum feature order 6: 2.82 RMSE
python3 linear_regression_features.py --data_size=30 --test_size=20 --range=9
Maximum feature order 1: 0.56 RMSE
Maximum feature order 2: 1.53 RMSE
Maximum feature order 3: 1.10 RMSE
Maximum feature order 4: 0.28 RMSE
Maximum feature order 5: 1.60 RMSE
Maximum feature order 6: 3.09 RMSE
Maximum feature order 7: 3.92 RMSE
Maximum feature order 8: 65.11 RMSE
Maximum feature order 9: 3886.97 RMSE
python3 linear_regression_features.py --data_size=50 --test_size=40 --range=9
Maximum feature order 1: 0.63 RMSE
Maximum feature order 2: 0.73 RMSE
Maximum feature order 3: 0.31 RMSE
Maximum feature order 4: 0.26 RMSE
Maximum feature order 5: 1.22 RMSE
Maximum feature order 6: 0.69 RMSE
Maximum feature order 7: 2.39 RMSE
Maximum feature order 8: 7.28 RMSE
Maximum feature order 9: 201.70 RMSE
Deadline: Oct 21, 22:00 2 points
Starting with the linear_regression_l2.py
template, use scikit-learn
to train L2-regularized linear regression models
and print the results of the best of them.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 linear_regression_l2.py --test_size=0.15
0.49 52.11
python3 linear_regression_l2.py --test_size=0.80
0.10 53.53
Deadline: Oct 21, 22:00 4 points
Starting with the linear_regression_sgd.py, implement minibatch SGD for linear regression and compare the results to an explicit linear regression solver.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 linear_regression_sgd.py --batch_size=10 --epochs=50 --learning_rate=0.01
Test RMSE: SGD 114.118, explicit 115.6
Learned weights: 6.864 6.907 -1.208 -2.252 -1.464 -13.323 13.909 4.883 -11.468 -0.229 37.803 -5.191 ...
python3 linear_regression_sgd.py --batch_size=10 --epochs=50 --learning_rate=0.1
Test RMSE: SGD 111.395, explicit 115.6
Learned weights: 11.559 12.428 -1.529 -2.236 -1.575 -8.868 18.842 3.882 -7.175 -1.373 38.918 -6.522 ...
python3 linear_regression_sgd.py --batch_size=10 --epochs=50 --learning_rate=0.001
Test RMSE: SGD 151.210, explicit 115.6
Learned weights: 1.885 -0.580 -0.386 0.389 -1.745 -6.994 6.787 3.019 -8.013 0.353 15.712 -3.322 ...
python3 linear_regression_sgd.py --batch_size=1 --epochs=50 --learning_rate=0.01
Test RMSE: SGD 111.395, explicit 115.6
Learned weights: 11.559 12.429 -1.529 -2.236 -1.574 -8.868 18.843 3.882 -7.174 -1.373 38.917 -6.522 ...
python3 linear_regression_sgd.py --batch_size=50 --epochs=50 --learning_rate=0.01
Test RMSE: SGD 136.015, explicit 115.6
Learned weights: 2.940 0.504 -0.555 0.143 -2.088 -10.664 9.146 4.607 -11.620 0.129 24.294 -4.089 ...
python3 linear_regression_sgd.py --batch_size=50 --epochs=500 --learning_rate=0.01
Test RMSE: SGD 111.914, explicit 115.6
Learned weights: 9.360 9.428 -1.333 -2.646 -1.379 -11.248 16.352 4.153 -9.041 -0.755 38.872 -5.881 ...
python3 linear_regression_sgd.py --batch_size=50 --epochs=500 --learning_rate=0.01 --l2=0.1
Test RMSE: SGD 113.521, explicit 115.6
Learned weights: 8.013 7.818 -1.227 -2.234 -1.491 -11.592 14.863 4.343 -9.807 -0.575 36.745 -5.487 ...
Deadline: Oct 21, 22:00 3 points
Starting with the feature_engineering.py
template, learn how to perform basic feature engineering using scikit-learn
.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 feature_engineering.py --dataset=diabetes
-0.5745 -0.9514 1.797 -0.4984 0.4751 0.9487 -0.6961 0.7574 0.06019 1.625 0.33 0.5465 -1.033 0.2863 -0.2729 -0.545 0.3999 -0.4351 -0.03458 -0.9334 0.9052 -1.71 0.4742 -0.452 -0.9026 0.6623 -0.7206 -0.05727 -1.546 3.23 -0.8959 0.8539 1.705 -1.251 1.361 0.1082 2.92 0.2484 -0.2368 -0.4729 0.347 -0.3775 -0.03 -0.8099 0.2257 0.4507 -0.3307 0.3598 0.0286 0.7719 0.9 -0.6604 0.7185 0.0571 1.541 0.4845 -0.5272 -0.0419 -1.131 0.5736 0.04559 1.231 0.003623 0.0978 2.64
0.2776 -0.9514 0.08366 -1.148 -1.592 -1.397 -0.4687 -0.7816 -0.3766 -1.973 0.07706 -0.2641 0.02322 -0.3186 -0.442 -0.3878 -0.1301 -0.217 -0.1045 -0.5477 0.9052 -0.0796 1.092 1.515 1.329 0.4459 0.7436 0.3583 1.877 0.007 -0.09602 -0.1332 -0.1169 -0.03921 -0.06539 -0.03151 -0.1651 1.317 1.827 1.603 0.5379 0.8971 0.4322 2.264 2.535 2.224 0.7462 1.245 0.5996 3.141 1.952 0.6548 1.092 0.5261 2.757 0.2197 0.3663 0.1765 0.9247 0.6109 0.2944 1.542 0.1418 0.7431 3.893
0.8198 1.051 -0.683 -0.8108 -0.6896 -0.4871 -0.2413 -0.03186 -0.2682 0.04527 0.6721 0.8617 -0.5599 -0.6647 -0.5653 -0.3993 -0.1978 -0.02612 -0.2199 0.03711 1.105 -0.7179 -0.8522 -0.7248 -0.512 -0.2536 -0.03348 -0.2819 0.04758 0.4665 0.5538 0.471 0.3327 0.1648 0.02176 0.1832 -0.03092 0.6574 0.5591 0.3949 0.1956 0.02583 0.2175 -0.0367 0.4755 0.3359 0.1664 0.02197 0.185 -0.03121 0.2373 0.1175 0.01552 0.1306 -0.02205 0.05822 0.007686 0.06472 -0.01092 0.001015 0.008544 -0.001442 0.07194 -0.01214 0.002049
0.9747 1.051 1.211 0.6803 0.6207 -0.9859 -1.151 1.547 2.783 2.853 0.9501 1.025 1.18 0.6631 0.605 -0.961 -1.122 1.508 2.712 2.781 1.105 1.273 0.715 0.6524 -1.036 -1.21 1.626 2.925 2.999 1.467 0.8239 0.7517 -1.194 -1.394 1.873 3.37 3.456 0.4628 0.4222 -0.6707 -0.7829 1.052 1.893 1.941 0.3852 -0.6119 -0.7143 0.96 1.727 1.771 0.972 1.135 -1.525 -2.743 -2.813 1.325 -1.78 -3.203 -3.284 2.392 4.304 4.413 7.743 7.94 8.142
-0.1872 -0.9514 0.1739 -1.171 -0.5149 -0.8915 0.5925 -0.8211 0.3554 -0.1302 0.03503 0.1781 -0.03254 0.2193 0.09637 0.1669 -0.1109 0.1537 -0.06651 0.02438 0.9052 -0.1654 1.115 0.4899 0.8482 -0.5637 0.7812 -0.3381 0.1239 0.03023 -0.2037 -0.08952 -0.155 0.103 -0.1428 0.06178 -0.02264 1.372 0.6032 1.044 -0.6941 0.9619 -0.4163 0.1526 0.2651 0.459 -0.3051 0.4228 -0.183 0.06706 0.7948 -0.5282 0.732 -0.3168 0.1161 0.3511 -0.4865 0.2106 -0.07717 0.6742 -0.2918 0.1069 0.1263 -0.04628 0.01696
0.9747 -0.9514 -0.1869 -0.3058 2.659 2.728 0.3651 0.7574 0.676 -0.1302 0.9501 -0.9274 -0.1822 -0.2981 2.592 2.659 0.3559 0.7382 0.6589 -0.1269 0.9052 0.1778 0.291 -2.53 -2.596 -0.3474 -0.7206 -0.6431 0.1239 0.03494 0.05717 -0.497 -0.51 -0.06824 -0.1416 -0.1264 0.02434 0.09354 -0.8132 -0.8344 -0.1117 -0.2316 -0.2067 0.03983 7.07 7.254 0.9708 2.014 1.797 -0.3463 7.443 0.9961 2.066 1.844 -0.3553 0.1333 0.2765 0.2468 -0.04755 0.5736 0.512 -0.09864 0.457 -0.08804 0.01696
1.982 -0.9514 0.715 0.4877 -0.5149 -0.3253 -0.01389 -0.8211 -0.4683 -0.4813 3.927 -1.885 1.417 0.9664 -1.02 -0.6447 -0.02753 -1.627 -0.928 -0.9537 0.9052 -0.6803 -0.464 0.4899 0.3095 0.01322 0.7812 0.4455 0.4579 0.5113 0.3487 -0.3682 -0.2326 -0.009932 -0.5871 -0.3348 -0.3441 0.2378 -0.2511 -0.1586 -0.006774 -0.4004 -0.2284 -0.2347 0.2651 0.1675 0.007152 0.4228 0.2411 0.2478 0.1058 0.004519 0.2671 0.1523 0.1566 0.000193 0.01141 0.006505 0.006685 0.6742 0.3845 0.3952 0.2193 0.2254 0.2316
1.362 1.051 -0.1418 -0.2337 2.193 1.084 1.123 -0.03186 1.76 -0.3935 1.855 1.432 -0.1932 -0.3183 2.987 1.476 1.53 -0.04339 2.397 -0.536 1.105 -0.1491 -0.2456 2.305 1.139 1.18 -0.03348 1.85 -0.4136 0.02011 0.03314 -0.311 -0.1537 -0.1593 0.004518 -0.2496 0.05581 0.05462 -0.5125 -0.2532 -0.2625 0.007445 -0.4113 0.09196 4.809 2.376 2.463 -0.06986 3.86 -0.8629 1.174 1.217 -0.03452 1.907 -0.4264 1.261 -0.03578 1.977 -0.4419 0.001015 -0.05607 0.01253 3.098 -0.6926 0.1548
2.059 -0.9514 1.031 1.69 1.174 0.8206 -1.606 3.046 2.055 1.274 4.24 -1.959 2.122 3.48 2.417 1.69 -3.306 6.273 4.231 2.623 0.9052 -0.9806 -1.608 -1.117 -0.7807 1.528 -2.898 -1.955 -1.212 1.062 1.742 1.21 0.8458 -1.655 3.14 2.118 1.313 2.857 1.984 1.387 -2.714 5.149 3.473 2.153 1.378 0.9633 -1.885 3.576 2.412 1.495 0.6734 -1.318 2.5 1.686 1.045 2.578 -4.891 -3.3 -2.045 9.279 6.26 3.88 4.223 2.618 1.623
0.2776 1.051 -0.48 -0.0173 0.8245 1.178 -0.1655 0.7574 -0.1065 -0.218 0.07706 0.2918 -0.1333 -0.004801 0.2289 0.327 -0.04594 0.2102 -0.02956 -0.06051 1.105 -0.5046 -0.01818 0.8666 1.238 -0.1739 0.7961 -0.1119 -0.2291 0.2304 0.008303 -0.3958 -0.5654 0.07944 -0.3636 0.05111 0.1046 0.0002992 -0.01426 -0.02037 0.002862 -0.0131 0.001842 0.00377 0.6798 0.9711 -0.1364 0.6245 -0.08779 -0.1797 1.387 -0.1949 0.8921 -0.1254 -0.2568 0.02739 -0.1253 0.01762 0.03608 0.5736 -0.08064 -0.1651 0.01134 0.02321 0.04752
python3 feature_engineering.py --dataset=linnerud
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
python3 feature_engineering.py --dataset=wine
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.2976 1.31 0.1177 0.9155 0.7783 0.5271 0.4232 0.6668 -1.122 1.048 0.7913 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1.282 -1.41 0.8239 -0.4697 -0.1538 0.2001 -1.152 1.377 -0.9146 -0.6609 0.7232 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.442 1.419 0.5492 -1.8 1.03 0.9984 -1.302 0.8977 -0.02794 -0.2336 1.336 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.4121 -0.9989 -1.844 -0.3312 -1.263 -0.6176 -0.6272 -0.3989 -1.174 0.4073 0.301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.9155 -0.7526 -0.2354 0.8601 -0.7751 -0.3001 0.4232 -0.02594 -1.174 1.646 -0.3936 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.778 0.6071 0.7454 -1.245 0.586 0.9888 -1.527 0.1517 -0.02794 0.06549 1.105 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.43 1.465 0.2746 -0.2204 0.8079 0.6233 -0.5521 -0.5766 0.03261 -0.3191 1.064 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.03446 0.3424 1.295 0.3614 -1.13 -1.445 1.173 -1.465 -0.2442 -0.7464 -0.3255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.8809 -0.8529 1.295 0.777 1.03 1.2 -0.6272 1.431 0.2316 1.048 0.2193 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.6752 -1.154 -1.765 -0.02646 -0.2869 -0.001945 -0.7772 -0.9496 -0.2096 0.7491 1.268 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
Deadline: Oct 21, 22:00 3 points+4 bonus
This assignment is a competition task. Your goal is to perform regression on the data from a bike rental shop. The train set contains 1000 instances, each instance consists of 12 features, both integral and real.
The rental_competition.py template shows how to load the training data, downloading it if needed. Furthermore, it shows how to save a trained estimator and how to load it during prediction.
The performance of your system is measured using root mean squared error and your goal is to achieve RMSE less than 100. Note that you can use any number of generalized linear models from sklearn to solve this assignment (but no other ML models like decision trees, MLPs, …; however, you can use any supporting methods like pre/post-processing, data manipulation, evaluation, cross-validation, …).
Deadline: Oct 28, 22:00 2 points
Starting with the perceptron.py template, implement the perceptron algorithm.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 perceptron.py --data_size=100 --seed=17
Learned weights 4.10 2.94 -1.00
python3 perceptron.py --data_size=50 --seed=320
Learned weights -2.30 -1.96 -2.00
python3 perceptron.py --data_size=200 --seed=92
Learned weights 4.43 1.54 -2.00
Deadline: Oct 28, 22:00 5 points
Starting with the logistic_regression_sgd.py, implement minibatch SGD for logistic regression.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 logistic_regression_sgd.py --data_size=100 --batch_size=10 --epochs=9 --learning_rate=0.5
After epoch 1: train loss 0.3259 acc 94.0%, test loss 0.3301 acc 96.0%
After epoch 2: train loss 0.2321 acc 96.0%, test loss 0.2385 acc 98.0%
After epoch 3: train loss 0.1877 acc 98.0%, test loss 0.1949 acc 98.0%
After epoch 4: train loss 0.1612 acc 98.0%, test loss 0.1689 acc 98.0%
After epoch 5: train loss 0.1435 acc 98.0%, test loss 0.1517 acc 98.0%
After epoch 6: train loss 0.1307 acc 98.0%, test loss 0.1396 acc 98.0%
After epoch 7: train loss 0.1208 acc 98.0%, test loss 0.1304 acc 96.0%
After epoch 8: train loss 0.1129 acc 98.0%, test loss 0.1230 acc 96.0%
After epoch 9: train loss 0.1065 acc 98.0%, test loss 0.1170 acc 96.0%
Learned weights 2.77 -0.60 0.12
python3 logistic_regression_sgd.py --data_size=95 --test_size=45 --batch_size=5 --epochs=9 --learning_rate=0.5
After epoch 1: train loss 0.2429 acc 96.0%, test loss 0.3187 acc 93.3%
After epoch 2: train loss 0.1853 acc 96.0%, test loss 0.2724 acc 93.3%
After epoch 3: train loss 0.1590 acc 96.0%, test loss 0.2525 acc 93.3%
After epoch 4: train loss 0.1428 acc 96.0%, test loss 0.2411 acc 93.3%
After epoch 5: train loss 0.1313 acc 98.0%, test loss 0.2335 acc 93.3%
After epoch 6: train loss 0.1225 acc 96.0%, test loss 0.2258 acc 93.3%
After epoch 7: train loss 0.1159 acc 96.0%, test loss 0.2220 acc 93.3%
After epoch 8: train loss 0.1105 acc 96.0%, test loss 0.2187 acc 93.3%
After epoch 9: train loss 0.1061 acc 96.0%, test loss 0.2163 acc 93.3%
Learned weights -0.61 3.61 0.12
python3 logistic_regression_sgd.py --data_size=95 --test_size=45 --batch_size=1 --epochs=9 --learning_rate=0.7
After epoch 1: train loss 0.1141 acc 96.0%, test loss 0.2268 acc 93.3%
After epoch 2: train loss 0.0867 acc 96.0%, test loss 0.2150 acc 91.1%
After epoch 3: train loss 0.0797 acc 98.0%, test loss 0.2320 acc 88.9%
After epoch 4: train loss 0.0753 acc 96.0%, test loss 0.2224 acc 88.9%
After epoch 5: train loss 0.0692 acc 96.0%, test loss 0.2154 acc 88.9%
After epoch 6: train loss 0.0749 acc 98.0%, test loss 0.2458 acc 88.9%
After epoch 7: train loss 0.0638 acc 96.0%, test loss 0.2190 acc 88.9%
After epoch 8: train loss 0.0644 acc 98.0%, test loss 0.2341 acc 88.9%
After epoch 9: train loss 0.0663 acc 98.0%, test loss 0.2490 acc 88.9%
Learned weights -1.07 7.33 -0.40
Deadline: Oct 28, 22:00 2 points
Starting with grid_search.py
template, perform a hyperparameter grid search, evaluating hyperparameter performance
using a stratified k-fold crossvalidation, and finally evaluate a model
trained with best hyperparameters on all training data. The easiest way is
to utilize sklearn.model_selection.GridSearchCV
.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 grid_search.py --test_size=0.5
Test accuracy: 98.11%
python3 grid_search.py --test_size=0.7
Test accuracy: 96.26%
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 grid_search.py --test_size=0.5
Rank: 11 Cross-val: 86.7% lr__C: 0.01 lr__solver: lbfgs polynomial__degree: 1
Rank: 5 Cross-val: 92.7% lr__C: 0.01 lr__solver: lbfgs polynomial__degree: 2
Rank: 11 Cross-val: 86.7% lr__C: 0.01 lr__solver: sag polynomial__degree: 1
Rank: 5 Cross-val: 92.7% lr__C: 0.01 lr__solver: sag polynomial__degree: 2
Rank: 7 Cross-val: 91.0% lr__C: 1 lr__solver: lbfgs polynomial__degree: 1
Rank: 2 Cross-val: 96.8% lr__C: 1 lr__solver: lbfgs polynomial__degree: 2
Rank: 8 Cross-val: 90.8% lr__C: 1 lr__solver: sag polynomial__degree: 1
Rank: 3 Cross-val: 96.8% lr__C: 1 lr__solver: sag polynomial__degree: 2
Rank: 10 Cross-val: 90.1% lr__C: 100 lr__solver: lbfgs polynomial__degree: 1
Rank: 4 Cross-val: 96.4% lr__C: 100 lr__solver: lbfgs polynomial__degree: 2
Rank: 9 Cross-val: 90.5% lr__C: 100 lr__solver: sag polynomial__degree: 1
Rank: 1 Cross-val: 97.0% lr__C: 100 lr__solver: sag polynomial__degree: 2
Test accuracy: 98.11%
python3 grid_search.py --test_size=0.7
Rank: 11 Cross-val: 87.9% lr__C: 0.01 lr__solver: lbfgs polynomial__degree: 1
Rank: 5 Cross-val: 91.8% lr__C: 0.01 lr__solver: lbfgs polynomial__degree: 2
Rank: 11 Cross-val: 87.9% lr__C: 0.01 lr__solver: sag polynomial__degree: 1
Rank: 5 Cross-val: 91.8% lr__C: 0.01 lr__solver: sag polynomial__degree: 2
Rank: 7 Cross-val: 91.5% lr__C: 1 lr__solver: lbfgs polynomial__degree: 1
Rank: 2 Cross-val: 95.9% lr__C: 1 lr__solver: lbfgs polynomial__degree: 2
Rank: 8 Cross-val: 91.3% lr__C: 1 lr__solver: sag polynomial__degree: 1
Rank: 3 Cross-val: 95.7% lr__C: 1 lr__solver: sag polynomial__degree: 2
Rank: 9 Cross-val: 89.4% lr__C: 100 lr__solver: lbfgs polynomial__degree: 1
Rank: 4 Cross-val: 95.2% lr__C: 100 lr__solver: lbfgs polynomial__degree: 2
Rank: 10 Cross-val: 89.2% lr__C: 100 lr__solver: sag polynomial__degree: 1
Rank: 1 Cross-val: 96.1% lr__C: 100 lr__solver: sag polynomial__degree: 2
Test accuracy: 96.26%
Deadline: Oct 28, 22:00 3 points+4 bonus
This assignment is a competition task. Your goal is to perform binary classification – given medical data with 15 binary and 6 real-valued attributes, predict whether thyroid is functioning normally or not. The train set and test set consist of ~3.5k instances.
The thyroid_competition.py template shows how to load training data, downloading it if needed. Furthermore, it shows how to save a trained estimator and how to load it during prediction.
The performance of your system is measured using accuracy of correctly predicted examples and your goal is to achieve at least 96% accuracy. Note that you can use any number of generalized linear models from sklearn to solve this assignment (but no other ML models like decision trees, MLPs, …; however, you can use any supporting methods like pre/post-processing, data manipulation, evaluation, cross-validation, …).
In the competitions, your goal is to train a model and then predict target values on the test set available only in ReCodEx.
When submitting a competition solution to ReCodEx, you should submit a trained model and a Python source capable of running it.
Furthermore, please also include the Python source and hyperparameters
you used to train the submitted model. But be careful that there still must be
exactly one Python source with a line starting with def main(
.
Do not forget about the maximum allowed model size and time and memory limits.
Before the deadline, ReCodEx prints the exact achieved score, but only if it is worse than the baseline.
If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached score.
After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.
After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.
Each competition will be scored after the first deadline.
The bonus points will be computed in the following fashion:
Let be the maximal number of bonus points that can be achieved in the competition.
All of the solutions that surpass the baseline will be sorted and divided into groups of equal size.
Every solution in the top group gets B points, the next group gets points, etc., the last group gets 0 bonus points.
The team solution only occupies one position in the table of the competition results.
Please, do not forget that every member of the team needs to upload the solution to ReCodEx and to submit both the training/prediction source code and the trained model itself.
numpy
or scipy
, anything you
implement yourself, and, unless specified otherwise in assignment
description, any method from sklearn
. Furthermore, the solution must be
created by you, and you must understand it fully. Do not use deep
network frameworks like TensorFlow or PyTorch.Installing to central user packages repository
You can install all required packages to central user packages repository using
pip3 install --user scikit-learn==1.5.2 numpy==2.1.1 scipy==1.14.1 pandas==2.2.2 matplotlib==3.9.2
.
Installing to a virtual environment
Python supports virtual environments, which are directories containing
independent sets of installed packages. You can create a virtual environment
by running python3 -m venv VENV_DIR
followed by
VENV_DIR/bin/pip3 install scikit-learn==1.5.2 numpy==2.1.1 scipy==1.14.1 pandas==2.2.2 matplotlib==3.9.2
(or VENV_DIR/Scripts/pip3
on Windows).
Windows installation
python3
is not in PATH, while py
command
is – in that case you can use py -m venv VENV_DIR
, which uses the newest
Python available, or for example py -3.11 -m venv VENV_DIR
, which uses
Python version 3.11.Is it possible to keep the solutions in a Git repository?
Definitely. Keeping the solutions in a branch of your repository, where you merge them with the course repository, is probably a good idea. However, please keep the cloned repository with your solutions private.
On GitHub, do not create a public fork with your solutions
If you keep your solutions in a GitHub repository, please do not create a clone of the repository by using the Fork button – this way, the cloned repository would be public.
Of course, if you just want to create a pull request, GitHub requires a public fork and that is fine – just do not store your solutions in it.
How to clone the course repository?
To clone the course repository, run
git clone https://github.com/ufal/npfl129
This creates the repository in the npfl129
subdirectory; if you want a different
name, add it as a last parameter.
To update the repository, run git pull
inside the repository directory.
How to keep the course repository as a branch in your repository?
If you want to store the course repository just in a local branch of your existing repository, you can run the following command while in it:
git remote add upstream https://github.com/ufal/npfl129
git fetch upstream
git checkout -t upstream/master
This creates a branch master
; if you want a different name, add
-b BRANCH_NAME
to the last command.
In both cases, you can update your checkout by running git pull
while in it.
How to merge the course repository with your modifications?
If you want to store your solutions in a branch merged with the course repository, you should start by
git remote add upstream https://github.com/ufal/npfl129
git pull upstream master
which creates a branch master
; if you want a different name,
change the last argument to master:BRANCH_NAME
.
You can then commit to this branch and push it to your repository.
To merge the current course repository with your branch, run
git merge upstream master
while in your branch. Of course, it might be necessary to resolve conflicts if both you and I modified the same place in the templates.
What files can be submitted to ReCodEx?
You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.
What file does ReCodEx execute and what arguments does it use?
Exactly one file with py
suffix must contain a line starting with def main(
.
Such a file is imported by ReCodEx and the main
method is executed
(during the import, __name__ == "__recodex__"
).
The file must also export an argument parser called parser
. ReCodEx uses its
arguments and default values, but it overwrites some of the arguments
depending on the test being executed – the template should always indicate which
arguments are set by ReCodEx and which are left intact.
What are the time and memory limits?
The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution. For competition assignments, the time limit is 5 minutes.
To pass the practicals, you need to obtain at least 70 points, excluding the bonus points. Note that up to 40 points above 70 (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 105 points (not including the bonus points) will be available.
To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every pair of lectures). In addition, you can get at most 40 surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.
Lecture 1 Questions
Explain how reinforcement learning differs from supervised and unsupervised learning in terms of experience that the learning algorithms use model performance. [5]
Explain why we need separate training and test data. What is generalization, and how does the concept relate to underfitting and overfitting? [10]
Define the prediction function of a linear regression model and write down -regularized mean squared error loss. [10]
Starting from the unregularized sum of squares error of a linear regression model, show how the explicit solution can be obtained, assuming is invertible. [10]
Lecture 2 Questions
Describe standard gradient descent and compare it to stochastic (i.e., online) gradient descent and minibatch stochastic gradient descent. Explain what it is used for in machine learning. [10]
Explain possible intuitions behind regularization. [5]
Explain the difference between hyperparameters and parameters. [5]
Write an -regularized minibatch SGD algorithm for training a linear regression model, including the explicit formulas (i.e, formulas you would need to code it with numpy
) of the loss function and its gradient. [10]
Does the SGD algorithm for linear regression always find the best solution on the training data? If yes, explain under what conditions it happens; if not, explain why it is not guaranteed to converge. What properties of the error function does this depend on? [10]
After training a model with SGD, you ended up with a low training error and a high test error. Using the learning curves, explain what might have happened and what steps you might take to prevent this from happening. [10]
You were given a fixed training set and a fixed test set, and you are supposed to report model performance on that test set. You need to decide what hyperparameters to use. How will you proceed and why? [10]
What methods can be used to normalize feature values? Explain why it is useful. [5]
Lecture 3 Questions
Define binary classification, write down the perceptron algorithm, and show how a prediction is made for a given data instance . [10]
For discrete random variables, define entropy, cross-entropy, and Kullback-Leibler divergence, and prove the Gibbs inequality (i.e., that KL divergence is non-negative). [20]
Explain the notion of likelihood in maximum likelihood estimation. What likelihood are we estimating in machine learning, and why do we do it? [5]
Describe maximum likelihood estimation as minimizing NLL, cross-entropy, and KL divergence and explain whether they differ or are the same and why. [20]
Provide an intuitive justification for why cross-entropy is a good optimization objective in machine learning. What distributions do we compare in cross-entropy? Why is it good when the cross-entropy is low? [5]
Considering the binary logistic regression model, write down its parameters (including their size) and explain how prediction is performed (including the explicit formula for the sigmoid function). [10]
Write down an -regularized minibatch SGD algorithm for training a binary logistic regression model, including the explicit formulas (i.e., formulas you would need to code it in numpy
) of the loss function and its gradient (saying just \grad is not enough). [20]