亚洲十八**毛片_亚洲综合影院_五月天精品一区二区三区_久久久噜噜噜久久中文字幕色伊伊 _欧美岛国在线观看_久久国产精品毛片_欧美va在线观看_成人黄网大全在线观看_日韩精品一区二区三区中文_亚洲一二三四区不卡

COMP 330代做、Python設計程序代寫

時間:2024-04-02  來源:  作者: 我要糾錯



COMP 330 Assignment #5
1 Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents. The implementation will be in Python, on top of Spark. To handle the large data set that we will be
giving you, it is necessary to use Amazon AWS.
You will be asked to perform three subtasks: (1) data preparation, (2) learning (which will be done via
gradient descent) and (3) evaluation of the learned model.
Note: It is important to complete HW 5 and Lab 5 before you really get going on this assignment. HW
5 will give you an opportunity to try out gradient descent for learning a model, and Lab 5 will give you
some experience with writing efficient NumPy code, both of which will be important for making your A5
experience less challenging!
2 Data
You will be dealing with a data set that consists of around 170,000 text documents and a test/evaluation
data set that consists of 18,700 text documents. All but around 6,000 of these text documents are Wikipedia
pages; the remaining documents are descriptions of Australian court cases and rulings. At the highest level,
your task is to build a classifier that can automatically figure out whether a text document is an Australian
court case.
We have prepared three data sets for your use.
1. The Training Data Set (1.9 GB of text). This is the set you will use to train your logistic regression
model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
2. The Testing Data Set (200 MB of text). This is the set you will use to evaluate your model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
3. The Small Data Set (37.5 MB of text). This is for you to use for training and testing of your model on
a smaller data set:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/SmallTrainingDataOneLinePerDoc.txt
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt
file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document
begins with a <doc id = ... > tag, and ends with </doc>. All documents are contained on a single
line of text.
Note that all of the Australia legal cases begin with something like <doc id = ‘‘AU1222’’ ...>;
that is, the doc id for an Australian legal case always starts with AU. You will be trying to figure out if the
document is an Australian legal case by looking only at the contents of the document.
1
3 The Tasks
There are three separate tasks that you need to complete to finish the assignment. As usual, it makes
sense to implement these and run them on the small data set before moving to the larger one.
3.1 Task 1
First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words
in the training corpus. This dictionary is essentially an RDD that has the word as the key, and the relative
frequency position of the word as the value. For example, the value is zero for the most frequent word, and
19,999 for the least frequent word in the dictionary.
To get credit for this task, give us the frequency position of the words “applicant”, “and”, “attack”,
“protein”, and “car”. These should be values from 0 to 19,999, or -1 if the word is not in the dictionary,
because it is not in the to 20,000.
Note that accomplishing this will require you to use a variant of your A4 solution. If you do not trust
your A4 solution and would like mine, you can post a private request on Piazza.
3.2 Task 2
Next, you will convert each of the documents in the training set to a TF-IDF vector. You will then use
a gradient descent algorithm to learn a logistic regression model that can decide whether a document is
describing an Australian court case or not. Your model should use l2 regularization; you can play with in
things a bit to determine the parameter controlling the extent of the regularization. We will have enough
data that you might find that the regularization may not be too important (that is, it may be that you get good
results with a very small weight given to the regularization constant).
I am going to ask that you not just look up the gradient descent algorithm on the Internet and implement
it. Start with the LLH function from class, and then derive your own gradient descent algorithm. We can
help with this if you get stuck.
At the end of each iteration, compute the LLH of your model. You should run your gradient descent
until the change in LLH across iterations is very small.
Once you have completed this task, you will get credit by (a) writing up your gradient update formula,
and (b) giving us the fifty words with the largest regression coefficients. That is, those fifty words that are
most strongly related with an Australian court case.
3.3 Task 3
Now that you have trained your model, it is time to evaluate it. Here, you will use your model to predict
whether or not each of the testing points correspond to Australian court cases. To get credit for this task,
you need to compute for us the F1 score obtained by your classifier—we will use the F1 score obtained as
one of the ways in which we grade your Task 3 submission.
Also, I am going to ask you to actually look at the text for three of the false positives that your model
produced (that is, Wikipedia articles that your model thought were Australian court cases). Write paragraph
describing why you think it is that your model was fooled. Were the bad documents about Australia? The
legal system?
If you don’t have three false positives, just use the ones that you had (if any).
4 Important Considerations
Some notes regarding training and implementation. As you implement and evaluate your gradient descent algorithm, here are a few things to keep in mind.
2
1. To get good accuracy, you will need to center and normalize your data. That is, transform your data so
that the mean of each dimension is zero, and the standard deviation is one. That is, subtract the mean
vector from each data point, and then divide the result by the vector of standard deviations computed
over the data set.
2. When classifying new data, a data point whose dot product with the set of regression coefs is positive
is a “yes”, a negative is a “no” (see slide 15 in the GLM lecture). You will be trying to maximize the
F1 of your classifier and you can often increase the F1 by choosing a different cutoff between “yes”
and “no” other than zero. Another thing that you can do is to add another dimension whose value is
one in each data point (we discussed this in class). The learning process will then choose a regression
coef for this special dimension that tends to balance the “yes” and “no” nicely at a cutoff of zero.
However, some students in the past have reported that this can increase the training time.
3. Students sometimes face overflow problems, both when computing the LLH and when computing the
gradient update. Some things that you can do to avoid this are, (1) use np.exp() which seems to
be quite robust, and (2) transform your data so that the standard deviation is smaller than one—if you
have problems with a standard deviation of one, you might try 10−2 or even 10−5
. You may need to
experiment a bit. Such are the wonderful aspects of implementing data science algorithms in the real
world!
4. If you find that your training takes more than a few hours to run to convergence on the largest data set,
it likely means that you are doing something that is inherently slow that you can speed up by looking
at your code carefully. One thing: there is no problem with first training your model on a small sample
of the large data set (say, 10% of the documents) then using the result as an initialization, and continue
training on the full data set. This can speed up the process of reaching convergence.
Big data, small data, and grading. The first two tasks are worth three points, the last four points. Since it
can be challenging to run everything on a large data set, we’ll offer you a small data option. If you train your
data on TestingDataOneLinePerDoc.txt, and then test your data on SmallTrainingDataOneLinePerDoc.twe’ll take off 0.5 points on Task 2 and 0.5 points on Task 3. This means you can still get an A, and
you don’t have to deal with the big data set. For the possibility of getting full credit, you can train
your data on the quite large TrainingDataOneLinePerDoc.txt data set, and then test your data
on TestingDataOneLinePerDoc.txt.
4.1 Machines to Use
If you decide to try for full credit on the big data set you will need to run your Spark jobs three to five
machines as workers, each having around 8 cores. If you are not trying for the full credit, you can likely
get away with running on a smaller cluster. Remember, the costs WILL ADD UP QUICKLY IF YOU
FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and shut down your cluster as soon as
you are done working. You can always create a new one easily when you begin your work again.
4.2 Turnin
Create a single document that has results for all three tasks. Make sure to be very clear whether you
tried the big data or small data option. Turn in this document as well as all of your code. Please zip up all
of your code and your document (use .gz or .zip only, please!), or else attach each piece of code as well as
your document to your submission individually. Do NOT turn in anything other than your Python code and
請加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp













 

標簽:

掃一掃在手機打開當前頁
  • 上一篇:AIC2100代寫、Python設計程序代做
  • 下一篇:COMP3334代做、代寫Python程序語言
  • 無相關信息
    昆明生活資訊

    昆明圖文信息
    蝴蝶泉(4A)-大理旅游
    蝴蝶泉(4A)-大理旅游
    油炸竹蟲
    油炸竹蟲
    酸筍煮魚(雞)
    酸筍煮魚(雞)
    竹筒飯
    竹筒飯
    香茅草烤魚
    香茅草烤魚
    檸檬烤魚
    檸檬烤魚
    昆明西山國家級風景名勝區
    昆明西山國家級風景名勝區
    昆明旅游索道攻略
    昆明旅游索道攻略
  • 短信驗證碼平臺 理財 WPS下載

    關于我們 | 打賞支持 | 廣告服務 | 聯系我們 | 網站地圖 | 免責聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 kmw.cc Inc. All Rights Reserved. 昆明網 版權所有
    ICP備06013414號-3 公安備 42010502001045

    成人av三级| 色天天综合色天天久久| 国产乱一区二区| 毛片av一区二区| 日韩高清一区在线| 美女在线观看视频一区二区| 毛片av一区二区三区| 国产一区二区在线免费观看| 国产精品亚洲一区二区三区在线 | 男人的天堂在线| 自由色视频.| 青青草视频在线观看| 日本高清视频在线播放| 午夜小视频在线观看| av资源在线播放| 91成人在线| 欧美电影院免费观看| 欧美一级三级| av永久不卡| 91久久黄色| 狠狠色丁香婷婷综合| 成人成人成人在线视频| 日本一区二区三区四区| 亚洲精品成人精品456| 欧美天天综合色影久久精品| 在线综合亚洲欧美在线视频| 国产理论片免费观看| 福利在线观看| 自由日本语热亚洲人| 影音先锋欧美激情| 亚洲综合色站| 日本欧美一区二区在线观看| 91免费版在线| 精品日韩美女的视频高清| 91精品国产高清一区二区三区蜜臀| 国产高潮av| 日本高清中文字幕在线| yw.尤物在线精品视频| 亚洲自拍电影| 亚洲午夜激情在线| 成人动漫精品一区二区| 亚洲成人午夜电影| 天天堂资源网在线观看免费视频| 第一视频专区在线| 自由日本语热亚洲人| 伊人春色之综合网| 噜噜噜在线观看免费视频日韩| 成人午夜短视频| 偷窥少妇高潮呻吟av久久免费| 精品国产91乱码一区二区三区| 国产h在线观看| 香蕉久久一区| 永久亚洲成a人片777777| 狠狠色狠狠色综合日日91app| 亚洲免费在线观看视频| 欧美videofree性高清杂交| 伊人免费在线| 久久av偷拍| 久久这里有精品15一区二区三区| 国产精品全国免费观看高清 | 亚洲大型综合色站| 黄色春季福利在线看| 热三久草你在线| 欧美激情理论| 26uuu久久天堂性欧美| 欧美军同video69gay| 欧美尤物美女在线| 日本福利一区| 国产一区二区91| 色综合久久久网| 嫩草精品影院| 中文字幕一区二区三区日韩精品| 麻豆9191精品国产| 午夜视频一区二区三区| 欧美少妇另类| 国产精品jk白丝蜜臀av小说| 老司机精品视频导航| 狠狠久久五月精品中文字幕| 九色视频在线播放| 偷拍亚洲精品| 成人一级片网址| 欧美一区二区成人6969| 国产免费拔擦拔擦8x高清在线人| 久久高清免费| 国产精品传媒视频| 狠狠操在线视频| 国产成人福利av| 国产乱妇无码大片在线观看| 欧美日韩一级片网站| av在线不卡免费| 亚洲三级色网| 欧美色图在线视频| 2024最新电影免费在线观看| 欧美日韩爆操| 亚洲国产日韩综合久久精品| 日韩伦理在线电影| 视频在线不卡免费观看| 亚洲欧美国产高清| 国产精品免费观看| 亚洲色图88| 欧美日韩国产色| 国产啊啊啊视频在线观看| 亚洲国产精品第一区二区三区| 福利微拍一区二区| 第四色日韩影片| 视频在线在亚洲| 欧美一级生活片| 久久精品超碰| 成人av在线播放网址| 最新av中文字幕| 一个色免费成人影院| 中文字幕中文字幕一区二区| 色哟哟免费在线观看 | 午夜精品一区在线观看| av电影院在线看| 久久精品国产在热久久| 精品久久久久久久人人人人传媒| 欧美日本三级| 国产精品久久久久精k8| 免费在线毛片网站| 国产精品三上| 91麻豆精品国产综合久久久久久| 国产精品一区免费在线| 中文字幕第一区二区| 色婷婷av在线| 国产精品99久久久久久宅男| 精品女厕厕露p撒尿| 亚洲mv大片欧洲mv大片| 在线观看成人免费视频| 99国内精品久久久久| 一区二区中文字幕在线| 国产在线88av| 26uuu亚洲| 丁香花视频在线观看| 国产成人在线视频免费播放| 青青九九免费视频在线| 免费亚洲视频| 免费在线观看视频| 欧美午夜在线视频| 成人18免费| 综合天堂av久久久久久久| 日韩视频免费观看高清在线视频| 亚洲第一论坛sis| 欧美日韩激情网| 试看120秒一区二区三区| 中文字幕一区二区三区精华液| 欧美激情护士| 亚洲丝袜另类动漫二区| 日韩免费小视频| 1区2区3区精品视频| 美女av在线免费看| 日本一区二区三级电影在线观看| 不卡福利视频| 亚洲丝袜制服诱惑| 日本高清久久| 亚洲va韩国va欧美va精品 | 欧美一级高清片在线观看| 国产探花一区二区| 日韩免费视频线观看| 亚洲欧美一级二级三级| 九色视频网站入口| 国产精品永久| 国产网站在线播放| 成人av资源站| www.综合| 亚洲一区二区在线播放相泽| 亚洲精品一二三**| 欧美优质美女网站| 9999国产精品| fc2人成共享视频在线观看| 蜜桃av噜噜一区| 18网站在线观看| 中文av字幕一区| 国产美女亚洲精品7777| 欧美优质美女网站| 午夜日韩在线| 国产youjizz在线| 久久女同精品一区二区| 91成人小视频| 91精品国产综合久久久蜜臀图片| 一区久久精品| 69视频在线| 国产日韩欧美在线一区| 自拍偷拍亚洲| 欧美成人一区二区三区在线观看| 日韩一区二区久久| 国产福利视频在线| 一区二区三区精品| 日韩中字在线| 能在线看的av| 综合色天天鬼久久鬼色| 婷婷国产精品| 嫩草影院发布页| 97精品久久久久中文字幕| 国产区一区二| 精品毛片乱码1区2区3区 | a一区二区三区| 欧美绝品在线观看成人午夜影视| 久久都是精品| 欧美成人a交片免费看|