January 09, 2025

Python crawls CSDN geek headlines

In the past two weeks, I spent some time reading "Python Network Data Collection". The content is not much, less than 200 pages, but it is very rich, there are getting started, there are improvements, there are precautions, there are experience, there are principles, there are analysis, After reading it, I have benefited a lot. The book talks about a lot of anti-reptiles, picture verification code and the like, but thanks to the openness of csdn, none of these. So the first exercise is to climb the updated article of csdn's geek headline.

1, ideas

The idea is simpler, first login, and then crawl the page to update the article name and link. One thing to note is that the list refresh of the geek headline is dynamic, and the new article list will only be loaded when the page has scroll bars and is pulled down. I tried it with a vertical screen display. Without the scroll bar, the list of 20 articles was displayed by default. As a result, the new article list could not be loaded. It should be considered a bug.

2, preparation

By capturing the developer tools in the browser, you can find the URL format of the geek headline when applying for a new list:

Http://geek.csdn.net/service/news/get_news_list?jsonpcallback=jQuery203014439105321047596_1516862462757&username=[account name]&from=-&size=20&type=hackernewsv2_new&_=1516862462758

Request parameters:

Jsonpcallback:

jQuery20302827217349787545_1516863701413 #This parameter is the function name of the anonymous callback function automatically generated by the jQuery framework. It is used for data processing when ajax gets data. Look at the source code of the webpage, it should use getJSON, so it is the parameter generated by the page, you can fill it out at will.

Username: [account name]

From:

6:252765 #This parameter represents the starting number of the article when the next time the article list is requested. If it is the first request list, fill in the '-' (short bar) here, as in the above example, the next number Will be carried in the JSON data returned by this request.

Size:

20 # The number of article entries requested this time, I have tried 1000 and all succeeded. . .

Type:

Hackernewsv2_new #Article type, type in the "hottest newest industry" on the home page, etc., the sub-headings, the classification is different, this parameter is different, the specific capture is visible

_:

1516863701415 #Nothing to use, that is, the number after the short bar under the first parameter is accumulated, the actual test can also be

By looking up the data and capturing the packet, it is found that the login of the csdn is still very simple. As long as the user name and password, no verification code, etc., the packet capture can see the request parameters:

Gps:

39.890503, 116.431339

Username:

[account name]

Password:

[Password] #包包的话是明码, if sent, it should be encrypted.

rememberMe:

True #Do you remember your password?

Lt:

LT-448149-vgNusKFi3i7wBRIZUrzCFLDfoDVP34 #This parameter is in the login main page, you need to parse it yourself, the value is random, you need to get it every time you log in.

Execution:

E3s1 # is currently a fixed value, and this value is different from the net text, so it is still good to get every login.

_eventId:

Submit #fixed value, which means submit

When logging in, it should be noted that in order to prevent crawlers, csdn requires that the User-Agent field of the HTTP header must be true, so I used the field filled in the real browser of the packet, otherwise it will always fail to log in and return to the login page.

By capturing the packet, you can see that after requesting the article, the returned json data, where 'from' is automatically used for the next request, the 'html' field is the returned web page, the utf-8 encoded Unicode string, Python defaults. It is Unicode, so after extracting the data of the html field, it will automatically convert to Chinese characters, symbols, etc., and then parse the link whose class type is 'title', and you can get the article link and name.

3, the code (very short)

600 Puffs Vape

Disposable 600 puffs vape pen are so convenient, portable, and small volume, you just need to take them
out of your pocket and take a puff, feel the cloud of smoke, and the fragrance of fruit surrounding you. It's so great.
We are China's leading manufacturer and supplier of disposable vape pen, 600 puff disposable vape,600 puff vape pen,vape pen 600 puffs,
600 puffs vape kit,600 puff e-cigarette, and e-cigarette kit, and we specialize in Disposable Vapes, e-cigarette vape pens, e-cigarette kits, etc.


600 puff disposable vape,600 puff vape pen,vape pen 600 puffs,600 puffs vape kit,600 puff e-cigarette

Ningbo Autrends International Trade Co.,Ltd. , https://www.mosvape.com