记一次上线方（流）案（程）

前言

为了简化后续爬虫项目上线的步骤和流程，把爬虫API和定时任务的项目整合在一块了，由于涉及到相关的服务较多，也需要遵循修改服务、启动的先后顺序（不然该启动的服务没开，后续相关的服务受影响就尴尬了），所以在上线前，用了将近一个上午的时间，整理了一个上线流程。在上线时能够提供一些清晰的帮助。

上线步骤

包安装

安装grequests、itemadapter包

进程操作

关闭进程

删除服务器上python所有task进程

ps -ef | grep _task| grep -v grep | awk '{print $2}' | xargs kill -9

删除服务器上pythoncallback进程

ps -ef | grep _callback| grep -v grep | awk '{print $2}' | xargs kill -9

删除服务器上爬虫api接口进程

ps -ef | grep gunicorn| grep -v grep | awk '{print $2}' | xargs kill -9

删除服务器上爬虫celery进程

ps -ef | grep _celery| grep -v grep | awk '{print $2}' | xargs kill -9

构建代码

TODO

开启进程

开启爬虫API进程

首先创建 touch /var/log/gunicorn.log

然后进入 cd /data/wwwroot/spiderapi-line

执行命令：gunicorn -w 3 -b 0.0.0.0:5000 --threads 16 -k gevent -t 1000 --access-logfile /var/log/gunicorn.log -D api:app

开启celery进程

nohup celery -A app.celeries.listing_celery worker -l info &
nohup celery -A app.celeries.comment_celery worker -l info &
nohup celery -A app.celeries.small_cate_celery worker -l info &
nohup celery -A app.celeries.keyword_celery worker -l info &
nohup celery -A app.celeries.qa_celery worker -l info &

开启爬虫进程

启动小类排名：nohup scrapy crawl smallCategorySpider &

开启任务回调进程

nohup /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m comment -t callback &
nohup /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m listing -t callback &
nohup /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m small_cate -t callback &
nohup /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m keyword -t callback &
nohup /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m qa -t callback &

数据库

Mongo

新建spider_amz_qa库，新增user表，新增一行数据

{
    "appid" : "ac539beebe55e46db51daacef575e336",
    "callback" : "http://cron.velocityecp.cn/api/spider/callback/qa",
    "uid" : "0"
}

新建user库，新增user表，新增一行数据

{
    "appid" : "ac539beebe55e46db51daacef575e336",
    "key" : "8aeb5d1b4e4484645f78ad30dddec6bc"
}

Supervisor服务

新增配置

新增ECP QA jobs配置文件，文件名：velocityecp-spider-qa.conf

[program:velocityecp-spider-qa-task]
process_name=%(program_name)s_%(process_num)02d
command=/usr/local/php/bin/php /data/wwwroot/ecpcron-line/artisan queue:work redis --queue=spider:qa:task --sleep=5 --tries=3
autostart=true
autorestart=true
user=root
numprocs=4
redirect_stderr=true
stdout_logfile=/var/log/supervisor/laravel-queue.log

新增爬虫QA配置文件，文件名：spider-qa.conf

[program:spider-qa-task]
process_name=%(program_name)s_%(process_num)02d
command=/usr/bin/python3.6.6/bin/scrapy crawl amazonQASpider -s JOBDIR=/var/log/crawler/comment_qa
directory=/data/wwwroot/spider-line/amazon_qa
autostart=true
autorestart=true
user=root
numprocs=3
redirect_stderr=true
stdout_logfile=/var/log/supervisor/spider-qa.log

新增爬虫QA question配置文件，文件名：spider-qa-question.conf

[program:spider-qa-question-task]
process_name=%(program_name)s_%(process_num)02d
command=/usr/bin/python3.6.6/bin/scrapy crawl amazonQAQuestionSpider -s JOBDIR=/var/log/crawler/comment_qa_question
directory=/data/wwwroot/spider-line/amazon_qa
autostart=true
autorestart=true
user=root
numprocs=4
redirect_stderr=true
stdout_logfile=/var/log/supervisor/spider-qa-question.log

修改配置

修改以下爬虫配置路径

spider-comment.conf directory=/data/wwwroot/spider-line/amazon_comment
spider-comment-detail.conf directory=/data/wwwroot/spider-line/amazon_comment
spider-keyword.conf directory=/data/wwwroot/spider-line/amazon_keyword
spider-listing.conf directory=/data/wwwroot/spider-line/amazon_listing

重启supervisor

supervisorctl reload

supervisorctl status

Crontab设置

设置crontab python爬虫任务 crontab -e

# QA爬虫任务
10 0 * * * /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m qa -t cron >> /var/log/crontab.log 2>&1

# listing爬虫任务
0 1 * * * /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m listing -t cron >> /var/log/crontab.log 2>&1

# 评论爬虫任务
30 4 * * 0,2,3,4,5,6 /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m comment -t cron >> /var/log/crontab.log 2>&1

# 关键词爬虫任务
0 4 * * * /usr/bin/python3.6.6/bin/python3 /data/wwwroot/spiderapi-line/shell.py -m keyword -t cron >> /var/log/crontab.log 2>&1

重点关注

本次上线内容主要如下：

合并了API和定时任务，统一为API项目
爬虫相关的进程服务交给了supervisor守护进程来管理
定时任务从Python的apscheduler包交给了crontab来统一管理
回调任务使用了grequests包，采用了协程的方式发送，提高发送效率

总结

在上线发布的时候，遵循这套流程，先做什么后做什么一目了然，大大提高了上线的效率，也降低了错误率。
由于相关的服务可以允许暂停一下，也因为上线的时候处于业务闲时，所以即使暂停一些服务，也对业务的影响不大。如果是那种对业务要求较高的场景，得重新想一套平滑上线的方案了。

后续

凌晨在家里等到项目中的任务启动的时候，去服务器观察了一下，发现有一些服务报错了，检查一下发现原来是该安装的包没有安装上。看来即使提前写了上线流程文档，也会有遗漏的地方，更何况什么都不写，凭记忆启动、开关各种服务的场景呢？那不漏的更多了，睡得就更晚了

第二天到公司之后，检查了一下昨天上线跟其他模块相关的业务，发现也没有出现什么问题。瞬间觉得，当项目上线内容较多又繁杂时，写一篇文档作为上线引导，还是很有必要的。

愿景

希望以后有时间能做一些自动化的工具出来，避免这么繁琐的操作。