How To Bypass Bot Detection And Scrape A Website Using Python?

Asked one year ago
Answer 1
Viewed 192
1

Web scratching is an effective method for social event information from the Web. Be that as it may, a few pages will safeguard themselves from bot traffic. This is principally to keep away from noxious traffic. A typical method for staying away from it is the utilization of a manual human test structure when dubious conduct has been identified.

In any case, the inquiry is, how could a site vary between human traffic and bot traffic? Could we at any point abstain from being recognized while scratching a site?

Note: This article accepts that the peruser is familiar with program meetings and treats. Regardless of whether this isn't accurate, the article will be valuable. Likewise, this article was written in December 2022, and the technique to sidestep the discovery couldn't work any longer.

Set up

Python adaptation is 3.8.10. This article will utilize the following Python bundle.

selenium==4.7.2

Additionally, we should download the Chrome webdriver. This should be possible from here. Once downloaded, unfasten the document and save it inside the task.

The issue

We should attempt to scrap a page to see what the issue is. For this model, we will scratch Idealista, a Spanish lodging site page. The full code will be shown toward the end, yet the article will go bit by bit to comprehend the issue better.

Thus, suppose that we start with the following basic code where we will simply open the page with the following code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

URL = 'https://www.idealista.com/'
s = Service('../chromedriver_linux64/chromedriver.exe') # Change with your path
options = Options()
driver = webdriver.Chrome(service=s, options=options)
driver.get(URL)

Alright, presently we will run the code utilizing the - I python banner so the window isn't shut:

python -i my_file.py

OK, so we physically complete the slider and we see that we actually can't get to the site:

The text is expressing that because of ill-advised utilize the entrance has been hindered regardless of whether we have tackled the manual human test. This implies that the site is identifying the bot.

The site lets us know that this can be occurring due to clicking speed, something impeding the Javascript of the site or a bot being in our equivalent organization.

For this situation, we can perceive how there is a note shown in the top piece of the program (under the URL) saying that we are utilizing mechanized programming. This is because of a banner that lets the program know that the route is being finished with a bot. To address this, we will add a few determinations to the driver.

The solution

Note: The full code is shown in the following area

To keep the site from obstructing the bot and allowing us to get to the substance, we will add a few choices to the driver.

More often than not we explore, we do it having the window amplified. So prior to making the solicitation to the page we will expand the program window:

options.add_argument("start-maximized")

Then, we want to dispose of the disclaimer under the URL. To cause it to vanish we will add the following choices:

options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])

With these two choices we can as of now access the substance once we clear the manual human test:

When this issue is addressed, assuming we run the program completely we will see that it actually shows the manual human test when the program is executed. The thing that matters is that now we approach a treat that we will use in ongoing executions to sidestep the manual human test and have the option to begin scratching the page.

Thus, to get this treat we want to add the following import in the top piece of our code as it will be the organization in which we will be saving the treat from the program:

import pickle

Then, we will run the python script in intuitive mode so we can save the treats with a line of code once the manual human test is cleared:

# This will open the web browser where we will be clearing the captcha manually
python -i my_file.py

# Once the captcha is completed and we are seeing the page,
# we will execute the next python line in the command line.
# The three arrows (>>>) mean that we have entered in the python command line

>>> pickle.dump(driver.get_cookies(), open('cookies.pkl', 'wb'))
>>> exit()

On the off chance that this is executed accurately, we will currently have a "cookies.pkl" record where we will have a legitimate treat.

Presently, we will add a few lines to our code to stack the treats in the wake of calling the page interestingly.

Significant: Treats must be stacked after the driver is on the site, any other way, we will have an InvalidCookieDomainException.

cookies = pickle.load(open("cookies.pkl", "rb"))
for cookie in cookies:
driver.add_cookie(cookie)

After this we will call the page in the future and we will have circumvent the manual human test effectively!

Full code

Blending every one of the past advances, our last code will be the following:

import pickle

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

URL = 'https://www.idealista.com/'

# Options to avoid browser detection
options = Options()
options.add_argument("start-maximized")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])

s = Service('../chromedriver_linux64/chromedriver.exe') # Put the path of your chromedriver
driver = webdriver.Chrome(service=s, options=options)

# We open the website before loading the cookies into the driver
# to avoid the InvalidCookieDomainException
driver.get(URL)

cookies = pickle.load(open("cookies.pkl", "rb"))

for cookie in cookies:
driver.add_cookie(cookie)

# We get the URL again with the cookies loaded
driver.get(URL)

Summary and conclusions

In this article we have perceived how we can utilize a workaround to scrap the sites by just finishing the manual human test once.

Notwithstanding, in the model uncovered the treat saved will terminate (its life expectancy is close to 1 day) and we should save it again physically, which could be an issue to mechanize and send off the cycle. In any case, this is an approach to permitting the crawler to work with negligible manual mediation simultaneously.

Contingent upon the site, we could adjust the treat values to make them not terminate and involve them in later executions. Be that as it may, to do as such, we would have to comprehend how the site utilizes its treats better.

Read Also : What are the top 3 cuisines in the world?
Answered one year ago Wartian  HerkkuWartian Herkku